Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [264]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error, silhouette_score, roc_auc_score
from sklearn.preprocessing import Imputer, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import operator
from sklearn.cluster import KMeans
import gc
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb


# magic word for producing visualizations in notebook
%matplotlib inline

Part 0: Get to Know the Data

There are four data files associated with this project:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

We start by loading the general population and the customers data

In [265]:
# gen_pop = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')
# customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')
gen_pop = pd.read_csv('gen_pop.csv')
customers = pd.read_csv('customers.csv')
//anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

One of the first steps that we're carrying out is to take a look at the top 5 rows of both datasets. On a quick look, they seem to have almost the same features. They have significantly different number of rows however. We will look on how the values of these features differ.

In [300]:
if 'Unnamed: 0' in gen_pop.columns:
    gen_pop.drop(['Unnamed: 0'], axis=1, inplace= True)
if 'Unnamed: 0' in customers.columns:
    customers.drop(['Unnamed: 0'], axis=1, inplace= True)
In [3]:
gen_pop.head()
Out[3]:
Unnamed: 0 LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ... VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB
0 0 910215 -1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 3 1 2
1 1 910220 -1 9.0 0.0 NaN NaN NaN NaN 21.0 ... 4.0 8.0 11.0 10.0 3.0 9.0 4.0 5 2 1
2 2 910225 -1 9.0 17.0 NaN NaN NaN NaN 17.0 ... 2.0 9.0 9.0 6.0 3.0 9.0 2.0 5 2 3
3 3 910226 2 1.0 13.0 NaN NaN NaN NaN 13.0 ... 0.0 7.0 10.0 11.0 NaN 9.0 7.0 3 2 4
4 4 910241 -1 1.0 20.0 NaN NaN NaN NaN 14.0 ... 2.0 3.0 5.0 4.0 2.0 9.0 3.0 4 1 3

5 rows × 367 columns

In [301]:
customers.head()
Out[301]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ... VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP PRODUCT_GROUP CUSTOMER_GROUP ONLINE_PURCHASE ANREDE_KZ ALTERSKATEGORIE_GROB
0 9626 2 1.0 10.0 NaN NaN NaN NaN 10.0 1.0 ... 2.0 6.0 9.0 7.0 3 COSMETIC_AND_FOOD MULTI_BUYER 0 1 4
1 9628 -1 9.0 11.0 NaN NaN NaN NaN NaN NaN ... 3.0 0.0 9.0 NaN 3 FOOD SINGLE_BUYER 0 1 4
2 143872 -1 1.0 6.0 NaN NaN NaN NaN 0.0 1.0 ... 11.0 6.0 9.0 2.0 3 COSMETIC_AND_FOOD MULTI_BUYER 0 2 4
3 143873 1 1.0 8.0 NaN NaN NaN NaN 8.0 0.0 ... 2.0 NaN 9.0 7.0 1 COSMETIC MULTI_BUYER 0 1 4
4 143874 -1 1.0 20.0 NaN NaN NaN NaN 14.0 7.0 ... 4.0 2.0 9.0 3.0 1 FOOD MULTI_BUYER 0 1 3

5 rows × 369 columns

Here we investigate the size of the two datasets and we can see that the general population dataset is significantly larger with a size of 2.7 GB for its 891221 rows as oppose to the 613.7 MB for the 191652 rows of the customers dataset. We will attempt to tackle this size issue by setting the columns to the right type in the preprocessing section.

Another point that is addressed below is the features present in the customer dataset but not in the general population. These are: 'ONLINE_PURCHASE', 'CUSTOMER_GROUP', 'PRODUCT_GROUP'

In [9]:
# Be sure to add in a lot more cells (both markdown and code) to document your
# approach and findings!
gen_pop.info(memory_usage='deep')
print('\n')
customers.info(memory_usage='deep')
print('\n number of columns in general population:', len(gen_pop.columns))
print('\n number of columns in customers:', len(customers.columns), '\n')
print(set(customers.columns)-set(gen_pop.columns))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(93), object(6)
memory usage: 2.7 GB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191652 entries, 0 to 191651
Columns: 369 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(94), object(8)
memory usage: 613.7 MB

 number of columns in general population: 366

 number of columns in customers: 369 

{'ONLINE_PURCHASE', 'CUSTOMER_GROUP', 'PRODUCT_GROUP'}

In the two cells below we look at different attributes for each feature such as the mean, the min and max values. The difference between the two datasets is not easily discerned using the table form and given the large number of features. Therefore we will attempt to visualize these differences.

In [10]:
customers.describe()
Out[10]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ... VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ONLINE_PURCHASE ANREDE_KZ ALTERSKATEGORIE_GROB
count 191652.000000 191652.000000 145056.000000 145056.000000 11766.000000 5100.000000 1275.000000 236.000000 139810.000000 141725.000000 ... 143781.000000 143781.000000 143781.000000 137910.000000 145056.000000 141725.000000 191652.000000 191652.000000 191652.000000 191652.000000
mean 95826.500000 0.344359 1.747525 11.352009 12.337243 13.672353 14.647059 15.377119 10.331579 4.965863 ... 4.374417 4.564769 3.168868 4.152716 8.646371 3.723133 2.576806 0.090247 1.376432 3.060907
std 55325.311233 1.391672 1.966334 6.275026 4.006050 3.243335 2.753787 2.307653 4.134828 14.309694 ... 2.924355 2.887035 2.233516 1.974375 1.154001 2.095540 1.168486 0.286536 0.484492 1.086254
min 1.000000 -1.000000 1.000000 0.000000 2.000000 2.000000 5.000000 8.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000
25% 47913.750000 -1.000000 1.000000 8.000000 9.000000 11.000000 13.000000 14.000000 9.000000 1.000000 ... 2.000000 2.000000 1.000000 2.000000 9.000000 2.000000 1.000000 0.000000 1.000000 3.000000
50% 95826.500000 0.000000 1.000000 11.000000 13.000000 14.000000 15.000000 16.000000 10.000000 1.000000 ... 4.000000 4.000000 3.000000 5.000000 9.000000 3.000000 3.000000 0.000000 1.000000 3.000000
75% 143739.250000 2.000000 1.000000 16.000000 16.000000 16.000000 17.000000 17.000000 13.000000 4.000000 ... 7.000000 7.000000 4.000000 6.000000 9.000000 5.000000 3.000000 0.000000 2.000000 4.000000
max 191652.000000 3.000000 9.000000 21.000000 18.000000 18.000000 18.000000 18.000000 25.000000 523.000000 ... 11.000000 13.000000 11.000000 6.000000 9.000000 8.000000 6.000000 1.000000 2.000000 9.000000

8 rows × 361 columns

In [11]:
gen_pop.describe()
Out[11]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ... VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB
count 8.912210e+05 891221.000000 817722.000000 817722.000000 81058.000000 29499.000000 6170.000000 1205.000000 628274.000000 798073.000000 ... 770025.000000 815304.000000 815304.000000 815304.000000 783619.000000 817722.000000 798073.000000 891221.000000 891221.000000 891221.000000
mean 6.372630e+05 -0.358435 4.421928 10.864126 11.745392 13.402658 14.476013 15.089627 13.700717 8.287263 ... 2.417322 6.001214 7.532130 5.945972 3.933406 7.908791 4.052836 3.362438 1.522098 2.777398
std 2.572735e+05 1.198724 3.638805 7.639683 4.097660 3.243300 2.712427 2.452932 5.079849 15.628087 ... 1.166572 2.856091 3.247789 2.771464 1.964701 1.923137 1.949539 1.352704 0.499512 1.068775
min 1.916530e+05 -1.000000 1.000000 0.000000 2.000000 2.000000 4.000000 7.000000 0.000000 0.000000 ... 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000
25% 4.144580e+05 -1.000000 1.000000 0.000000 8.000000 11.000000 13.000000 14.000000 11.000000 1.000000 ... 2.000000 3.000000 5.000000 4.000000 2.000000 8.000000 3.000000 3.000000 1.000000 2.000000
50% 6.372630e+05 -1.000000 3.000000 13.000000 12.000000 14.000000 15.000000 15.000000 14.000000 4.000000 ... 2.000000 6.000000 8.000000 6.000000 4.000000 9.000000 3.000000 3.000000 2.000000 3.000000
75% 8.600680e+05 -1.000000 9.000000 17.000000 15.000000 16.000000 17.000000 17.000000 17.000000 9.000000 ... 3.000000 9.000000 10.000000 8.000000 6.000000 9.000000 5.000000 4.000000 2.000000 4.000000
max 1.082873e+06 3.000000 9.000000 21.000000 18.000000 18.000000 18.000000 18.000000 25.000000 595.000000 ... 4.000000 11.000000 13.000000 11.000000 6.000000 9.000000 8.000000 6.000000 2.000000 9.000000

8 rows × 360 columns

The two cells below highlights the top columns in terms of size. These seem to be of an object or categorical type

In [12]:
(gen_pop.memory_usage(deep=True) * 1e-6).sort_values(ascending= False)[:10]
Out[12]:
EINGEFUEGT_AM               63.634284
OST_WEST_KZ                 55.653554
D19_LETZTER_KAUF_BRANCHE    53.480219
CAMEO_DEU_2015              49.909606
CAMEO_DEUG_2015             44.137244
CAMEO_INTL_2015             40.921738
KBA05_ANTG1                  7.129768
INNENSTADT                   7.129768
KBA05_ALTER1                 7.129768
KBA05_ALTER2                 7.129768
dtype: float64
In [13]:
(customers.memory_usage(deep=True) * 1e-6).sort_values(ascending= False)[:10]
Out[13]:
PRODUCT_GROUP               13.175592
CUSTOMER_GROUP              13.091750
EINGEFUEGT_AM               12.368764
D19_LETZTER_KAUF_BRANCHE    11.792569
OST_WEST_KZ                 10.951514
CAMEO_DEU_2015               9.945912
CAMEO_DEUG_2015              9.709596
CAMEO_INTL_2015              8.973210
ALTERSKATEGORIE_GROB         1.533216
KBA05_ANHANG                 1.533216
dtype: float64
In [266]:
cols = gen_pop.columns
num_cols_gen = gen_pop._get_numeric_data().columns
non_numeric_col_gen_pop = list(set(cols) - set(num_cols_gen))
In [267]:
non_numeric_col_gen_pop
Out[267]:
['CAMEO_DEUG_2015',
 'EINGEFUEGT_AM',
 'OST_WEST_KZ',
 'D19_LETZTER_KAUF_BRANCHE',
 'CAMEO_INTL_2015',
 'CAMEO_DEU_2015']
In [268]:
cols = customers.columns
num_cols = customers._get_numeric_data().columns
non_numeric_col_customers = list(set(cols) - set(num_cols))
In [269]:
non_numeric_col_customers
Out[269]:
['PRODUCT_GROUP',
 'CAMEO_DEUG_2015',
 'EINGEFUEGT_AM',
 'CUSTOMER_GROUP',
 'OST_WEST_KZ',
 'D19_LETZTER_KAUF_BRANCHE',
 'CAMEO_INTL_2015',
 'CAMEO_DEU_2015']

Exploring the correlation between the different features

In [270]:
corr = gen_pop[num_cols_gen].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, cbar_kws={"shrink": .5})
Out[270]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a456a7ac8>

Starting with loading the trainning and testing data

In [272]:
# df_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
# df_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
df_train = pd.read_csv('df_train.csv')
df_test = pd.read_csv('df_test.csv')

Dropping the Unnamed: 0 column

In [339]:
if 'Unnamed: 0' in df_train.columns:
    df_train.drop(['Unnamed: 0'], axis=1, inplace= True)
if 'Unnamed: 0' in df_test.columns:
    df_test.drop(['Unnamed: 0'], axis=1, inplace= True)

Checking the shape of both datasets

In [273]:
print(df_train.shape)
print(df_test.shape)
(42962, 368)
(42833, 367)

Adjusting the df_train shape to match the df_test in order to simplify the plotting process

In [275]:
df_train = df_train.iloc[:42833,:]

The next two cells describe the dataset in terms of mean, min, max, etc. The differences will be highlighted in the upcoming plots.

In [24]:
df_train.describe()
Out[24]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ... VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP RESPONSE ANREDE_KZ ALTERSKATEGORIE_GROB
count 42833.000000 42833.000000 35880.000000 35880.000000 1986.000000 755.000000 174.000000 41.000000 34696.000000 35074.000000 ... 35583.000000 35583.000000 35583.000000 33181.000000 35880.000000 35074.000000 42833.000000 42833.000000 42833.000000 42833.000000
mean 42835.296454 0.542105 1.525334 10.290245 12.608258 13.785430 14.655172 14.195122 9.855949 6.695387 ... 4.317652 4.505269 3.116825 4.487297 8.730100 4.059845 2.804076 0.012397 1.595149 3.213504
std 24790.632421 1.412944 1.741935 6.084463 3.926351 3.067162 2.615329 3.034959 4.374407 15.109469 ... 3.164883 3.289605 2.534631 1.890125 1.010057 2.046597 1.121264 0.110651 0.490869 1.067599
min 1.000000 -1.000000 1.000000 0.000000 2.000000 5.000000 6.000000 6.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000
25% 21304.000000 -1.000000 1.000000 8.000000 9.000000 12.000000 13.000000 13.000000 8.000000 1.000000 ... 1.000000 2.000000 1.000000 3.000000 9.000000 3.000000 3.000000 0.000000 1.000000 3.000000
50% 42794.000000 1.000000 1.000000 10.000000 13.000000 14.000000 15.000000 15.000000 10.000000 2.000000 ... 3.000000 4.000000 2.000000 6.000000 9.000000 3.000000 3.000000 0.000000 2.000000 4.000000
75% 64366.000000 2.000000 1.000000 15.000000 16.000000 16.000000 17.000000 17.000000 13.000000 7.000000 ... 7.000000 7.000000 4.000000 6.000000 9.000000 7.000000 3.000000 0.000000 2.000000 4.000000
max 85795.000000 3.000000 9.000000 21.000000 18.000000 18.000000 18.000000 18.000000 25.000000 438.000000 ... 11.000000 13.000000 11.000000 6.000000 9.000000 8.000000 6.000000 1.000000 2.000000 9.000000

8 rows × 361 columns

In [43]:
df_test.describe()
Out[43]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ... VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB
count 42833.000000 42833.000000 35944.000000 35944.000000 2013.000000 762.000000 201.000000 39.000000 34715.000000 35206.000000 ... 34530.000000 35658.000000 35658.000000 35658.000000 33214.000000 35944.000000 35206.000000 42833.000000 42833.000000 42833.000000
mean 42993.165620 0.537436 1.518890 10.239511 12.534029 13.942257 14.442786 14.410256 9.822584 6.749986 ... 2.372401 4.308682 4.488474 3.090078 4.489282 8.727437 4.088280 2.800037 1.595475 3.220484
std 24755.599728 1.414777 1.737441 6.109680 3.996079 3.142155 2.787106 2.279404 4.410937 14.839779 ... 1.151531 3.169149 3.274829 2.510134 1.886903 1.013702 2.053977 1.120624 0.490806 1.069753
min 2.000000 -1.000000 1.000000 0.000000 2.000000 4.000000 6.000000 9.000000 0.000000 0.000000 ... 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000
25% 21650.000000 -1.000000 1.000000 8.000000 9.000000 12.000000 13.000000 13.000000 8.000000 1.000000 ... 2.000000 1.000000 2.000000 1.000000 3.000000 9.000000 3.000000 3.000000 1.000000 3.000000
50% 43054.000000 1.000000 1.000000 10.000000 13.000000 14.000000 15.000000 14.000000 10.000000 2.000000 ... 2.000000 3.000000 4.000000 2.000000 6.000000 9.000000 3.000000 3.000000 2.000000 4.000000
75% 64352.000000 2.000000 1.000000 15.000000 16.000000 17.000000 17.000000 16.000000 13.000000 7.000000 ... 3.000000 7.000000 7.000000 4.000000 6.000000 9.000000 7.000000 3.000000 2.000000 4.000000
max 85794.000000 3.000000 9.000000 21.000000 18.000000 18.000000 18.000000 18.000000 25.000000 379.000000 ... 4.000000 11.000000 13.000000 11.000000 6.000000 9.000000 8.000000 6.000000 2.000000 9.000000

8 rows × 360 columns

Below we plot the response distribution within the training data. We can see that the data is unbalanced with respect to the response

In [277]:
sns.countplot(df_train['RESPONSE'])
plt.savefig('skewed_response.png')
In [ ]:
 
In [60]:
print("There are {}% response values with 1".format(100 * df_train["RESPONSE"].value_counts()[1]/df_train.shape[0]))
There are 1.2396983634113885% response values with 1
In [289]:
# from https://www.kaggle.com/gpreda/santander-eda-and-prediction
def plot_feature_distribution(df1, df2, label1, label2, features, col=10):
    """ A function that ploter the feature distribution
    Inputs:
    df1, df2: the comparsion dataframes
    labels: labeling the two distributions
    features: the feature to be compared on
    
    """
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(col,10,i)
        try:
            sns.distplot(df1[feature], hist=False,label=label1)
        except:
            pass
        try:
            sns.distplot(df2[feature], hist=False,label=label2)
        except:
            pass
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.savefig('response_0_vs_1_in_train.png')
    plt.show();

Here we explore the features distribution in the training data between those with a response and those with none

In [290]:
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[:100]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>
In [90]:
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[100:200]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>
In [91]:
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[200:300]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>
In [108]:
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[300:360]
plot_feature_distribution(t0, t1, '0', '1', features, col=6)
<Figure size 432x288 with 0 Axes>

We already can see that for some features, both have the distribution whereas for some there is a clear difference. These will be closely monitored in the modeling section

Now we move to focus on the distribution between training and testing data. From the plots below, we can see that there is no difference in distribution between the training and testing data.

In [287]:
features = num_cols[:100]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
In [96]:
features = num_cols[100:200]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
<Figure size 432x288 with 0 Axes>
In [97]:
features = num_cols[200:300]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
<Figure size 432x288 with 0 Axes>

Preprocessing

First we check for duplicates

In [10]:
gen_pop.duplicated().sum()
Out[10]:
0

We will utilize the functions from below to replace the specific unkown values for each feature with NaNs to capture the true NaN count in the data.

In [283]:
def replaceMissingWithNaN(col, missing):
    """ A function that returns the number of missing values per column and a list representing the column with missing values converted to NaN
    input:
    col: a Series that contains NaN
    missing: values that are considered NaN for this feature
    Outpu:
    count: number of missing values
    lst: the col with NaN now correctly presented as NaN
    """
    missing.extend(["XX", "X"])
    count = col.isna().sum()
    lst = []
    for item in col:
        if item in missing:
            count += 1
            lst.append(np.NaN)
        else:
            lst.append(item)
    
    return count, lst

def findNans(df, missing):
    """ a function that utilizes replaceMissingWithNaN to provide a DF with all missing values converted to NaN 
    Input:
    df: dataframe to find NaN for
    missing: dataframe with each attribute missing values glossary
    output:
    count_nan: number of NaN per feature
    new_df: the df with now all NaN presented as NaN
    """
    
    headers = df.columns.values

    counts = []
    new_df = pd.DataFrame()
    for i in range(len(df.columns.values)):
        feat = df[headers[i]]
        try:
            miss = list(missing_values.loc[headers[i], :].values) #missing[i]
        except KeyError:
            miss = []
        count , new_df[headers[i]] = replaceMissingWithNaN(feat, miss)
        counts.append([headers[i], count])
    
    count_nan = pd.DataFrame(counts, columns= ['feature', 'NaN_count'])
    
    return count_nan, new_df

def findNansPrep(attributes_values):
    """ A function that prepares a df that contains each attribute with its nan code
    Input: attributes_values: a missing value glossary - dataframe
    Output: dataframe with each attribute missing values glossary
    """
    missing_values = attributes_values[~attributes_values.Attribute.isnull()]
    missing_values.drop(columns=[missing_values.columns[0],'Description'], axis=1,inplace= True)
    missing_values.Meaning.replace({'unknown / no main age detectable': 'unknown'}, inplace= True)
    missing_values = missing_values[missing_values.Meaning == 'unknown']
    missing_values.drop(['Meaning'], axis=1, inplace= True)
    missing_values.set_index('Attribute', inplace= True)
    return missing_values
    
    

We will load the excel file that specify which value does correspond to NaN for each feature

In [284]:
# read xsl file and get values for NaN for each feature
attributes_values = pd.read_excel('DIAS Attributes - Values 2017.xlsx', header= 1)
In [285]:
missing_values = findNansPrep(attributes_values)
//anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
//anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:6586: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

Missing data per feature using the functions shown earlier

In [286]:
nan_count, gen_pop_with_nan = findNans(gen_pop, missing_values)

Here we calculate NaN as a percentage for each feature

In [305]:
nan_count['NaN_perc'] = nan_count.NaN_count.apply(lambda x: x/gen_pop_with_nan.shape[0])

We set a limit beyond which the feature is a candidate to be dropped

In [309]:
fig, ax = plt.subplots(figsize=[10, 5])
plt.hist(nan_count['NaN_perc'], bins=20);
plt.xlabel('percentage of missing data', fontsize=18)
plt.ylabel('frequency', fontsize=16)
fig.savefig('missing_data_feat_hist.png')
In [310]:
feature_NaN_limit = 0.4
In [311]:
nan_count[nan_count.NaN_perc> feature_NaN_limit]
Out[311]:
feature NaN_count NaN_perc
2 AGER_TYP 677503 0.760196
5 ALTER_KIND1 810163 0.909048
6 ALTER_KIND2 861722 0.966900
7 ALTER_KIND3 885051 0.993077
8 ALTER_KIND4 890016 0.998648
101 EXTSEL992 654153 0.733996
301 KK_KUNDENTYP 584612 0.655967

AGER_TYP: High correlation with FINANZ_SPARER, but linear regression shows bad fit, DROP
ALTER_HH: No good correlation but relatively lowr NaN percent, KEEP
ALTER_KIND: Good correlation with similar features that have a lot of data missing, DROP
EXTSEL992: Decent correlation with VK_DISTANZ, but linear regression shows bad fit, DROP
KK_KUNDENTYP: Good correlation with D19_GESAMT_DATUM, KEEP

In [15]:
feats = nan_count[nan_count.NaN_perc> feature_NaN_limit].feature
for feat in feats:
    print("\n", feat, "\n", corr.loc[feat, :].nlargest())
    print("\n", corr.loc[feat, :].nsmallest())
 AGER_TYP 
 AGER_TYP            1.000000
FINANZ_VORSORGER    0.581710
CJT_TYP_3           0.394965
CJT_TYP_5           0.384827
CJT_TYP_4           0.375220
Name: AGER_TYP, dtype: float64

 FINANZ_SPARER           -0.611760
FINANZ_ANLEGER          -0.536300
CJT_TYP_1               -0.508349
CJT_TYP_2               -0.505620
FINANZ_UNAUFFAELLIGER   -0.491088
Name: AGER_TYP, dtype: float64

 ALTER_KIND1 
 ALTER_KIND1       1.000000
ALTER_KIND2       0.781195
ALTER_KIND3       0.610869
ALTER_KIND4       0.477439
WOHNDAUER_2008    0.273142
Name: ALTER_KIND1, dtype: float64

 EINGEZOGENAM_HH_JAHR    -0.305786
ANZ_KINDER              -0.259130
CJT_TYP_1               -0.213948
CJT_TYP_2               -0.213663
FINANZ_UNAUFFAELLIGER   -0.209062
Name: ALTER_KIND1, dtype: float64

 ALTER_KIND2 
 ALTER_KIND2         1.000000
ALTER_KIND1         0.781195
ALTER_KIND3         0.774012
ALTER_KIND4         0.645995
FINANZ_VORSORGER    0.216992
Name: ALTER_KIND2, dtype: float64

 EINGEZOGENAM_HH_JAHR    -0.234734
ANZ_KINDER              -0.195075
FINANZ_UNAUFFAELLIGER   -0.178762
CJT_TYP_1               -0.161571
PRAEGENDE_JUGENDJAHRE   -0.154922
Name: ALTER_KIND2, dtype: float64

 ALTER_KIND3 
 ALTER_KIND3         1.000000
ALTER_KIND2         0.774012
ALTER_KIND4         0.764780
ALTER_KIND1         0.610869
FINANZ_VORSORGER    0.137743
Name: ALTER_KIND3, dtype: float64

 ANZ_KINDER              -0.207538
EINGEZOGENAM_HH_JAHR    -0.142383
CJT_TYP_1               -0.117233
FINANZ_UNAUFFAELLIGER   -0.117207
PRAEGENDE_JUGENDJAHRE   -0.103907
Name: ALTER_KIND3, dtype: float64

 ALTER_KIND4 
 ALTER_KIND4    1.000000
ALTER_KIND3    0.764780
ALTER_KIND2    0.645995
ALTER_KIND1    0.477439
ARBEIT         0.112273
Name: ALTER_KIND4, dtype: float64

 ANZ_KINDER            -0.233362
UNGLEICHENN_FLAG      -0.190781
BALLRAUM              -0.128128
INNENSTADT            -0.111323
KBA13_HERST_AUDI_VW   -0.108736
Name: ALTER_KIND4, dtype: float64

 EXTSEL992 
 EXTSEL992             1.000000
VHA                   0.314888
D19_VERSAND_ANZ_24    0.217907
D19_GESAMT_ANZ_24     0.209537
D19_VERSAND_ANZ_12    0.203654
Name: EXTSEL992, dtype: float64

 VK_DISTANZ                  -0.531864
VK_DHT4A                    -0.478065
VK_ZG11                     -0.448952
D19_VERSAND_OFFLINE_DATUM   -0.309373
D19_GESAMT_OFFLINE_DATUM    -0.276839
Name: EXTSEL992, dtype: float64

 KK_KUNDENTYP 
 KK_KUNDENTYP                1.000000
D19_GESAMT_DATUM            0.855140
D19_GESAMT_ONLINE_DATUM     0.551831
D19_VERSAND_DATUM           0.543987
D19_VERSAND_ONLINE_DATUM    0.513553
Name: KK_KUNDENTYP, dtype: float64

 D19_GESAMT_ANZ_12             -0.621466
D19_VERSAND_ANZ_12            -0.546991
D19_GESAMT_ANZ_24             -0.540012
D19_VERSAND_ANZ_24            -0.491816
D19_VERSAND_ONLINE_QUOTE_12   -0.242909
Name: KK_KUNDENTYP, dtype: float64

The cell below shows that all the ALTER_KIND data can not be used to imputate the missing values between them

In [16]:
gen_pop_with_nan[~gen_pop_with_nan.ALTER_KIND1.isnull() | ~gen_pop_with_nan.ALTER_KIND2.isnull() | ~gen_pop_with_nan.ALTER_KIND3.isnull() | ~gen_pop_with_nan.ALTER_KIND4.isnull()].shape[0]/gen_pop_with_nan.shape[0]
Out[16]:
0.09095162703751371
In [312]:
def linearRegFillNa(df, feature, util_feature):
    """ A function that utilize linear regression to fill NaN 
    Input:
    df: dataframe to be utilized
    feature: for which linear regression is applied
    util_feature: the feature utilized to extrapolate the wanted feature
    output: 
    predicted_feature: A series that represent the feature after extrapolation
    """
    shared_data_util_feature = df[~df[util_feature].isnull() & ~df[feature].isnull()][util_feature]
    shared_data_util_feature = shared_data_util_feature.values.reshape(-1, 1)
    
    shared_data_feature = df[~df[util_feature].isnull() & ~df[feature].isnull()][feature]
    shared_data_feature = shared_data_feature.values.reshape(-1, 1)
    
    reg = LinearRegression().fit(shared_data_util_feature, shared_data_feature)
    
    available_util_feature = df[~df[util_feature].isnull()][util_feature]
    available_util_feature = available_util_feature.values.reshape(-1, 1)
    
    predicted_feature = reg.predict(available_util_feature)
    predicted_feature = np.round(predicted_feature)
    
    return predicted_feature
    
    
def regFillApply(df, feature, util_feature):
    """ A function that utilize linearRegFillNa to add a new extrapolated feature to df 
    Input:
    df: dataframe to be utilized
    feature: for which linear regression is applied
    util_feature: the feature utilized to extrapolate the wanted feature
    output: 
    None - it changes the df 
    """
    df[feature + "_corr"] = np.nan
    df.loc[~df[util_feature].isnull(), feature + "_corr"] = linearRegFillNa(df, feature, util_feature)
    print("r2=",r2_score(df[~df[util_feature].isnull() & ~df[feature].isnull()][feature], df[~df[util_feature].isnull() & ~df[feature].isnull()][feature + "_corr"]))
    print("MSE=",mean_squared_error(df[~df[util_feature].isnull() & ~df[feature].isnull()][feature], df[~df[util_feature].isnull() & ~df[feature].isnull()][feature + "_corr"]))
    df[feature + "_mod"] = df[feature]
    df[feature + "_mod"].fillna(df[feature + "_corr"], inplace= True)
In [313]:
regFillApply(gen_pop_with_nan, 'AGER_TYP', 'FINANZ_SPARER')
r2= -0.1912770229268712
MSE= 0.6563134597928111
In [314]:
regFillApply(gen_pop_with_nan, 'EXTSEL992', 'VK_DISTANZ')
r2= 0.2794630388402656
MSE= 152.26013553528824
In [315]:
regFillApply(gen_pop_with_nan, 'KK_KUNDENTYP', 'D19_GESAMT_DATUM')
r2= 0.7720671189774412
MSE= 0.6047343685279949
In [21]:
gen_pop_with_nan.columns[-6:].values
Out[21]:
array(['AGER_TYP_corr', 'AGER_TYP_mod', 'EXTSEL992_corr', 'EXTSEL992_mod',
       'KK_KUNDENTYP_corr', 'KK_KUNDENTYP_mod'], dtype=object)
In [22]:
f, ax = plt.subplots(figsize=(11, 9))
plt.xlim(0, 7)
plt.ylim(0, 7)
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
gen_pop_with_nan.plot.hexbin(x='KK_KUNDENTYP_corr', y='KK_KUNDENTYP', gridsize=10, ax=ax)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27813198>
In [23]:
f, ax = plt.subplots(figsize=(11, 9))
plt.xlim(0, 7)
plt.ylim(0, 7)
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
gen_pop_with_nan.plot.hexbin(x='KK_KUNDENTYP_mod', y='KK_KUNDENTYP', gridsize=10, ax=ax)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25e0cbe0>
In [316]:
gen_pop_with_nan.drop(['AGER_TYP', 'AGER_TYP_corr','AGER_TYP_mod', 'EXTSEL992', 'EXTSEL992_corr', 'EXTSEL992_mod', 'KK_KUNDENTYP', 'KK_KUNDENTYP_corr'], axis=1, inplace=True)

Missing dat per row

In [317]:
# percentage of NaNs per row
gen_pop_with_nan['row_percent_nan']= gen_pop_with_nan.isnull().mean(axis=1) 
In [318]:
plt.hist(gen_pop_with_nan['row_percent_nan'])
Out[318]:
(array([7.48851e+05, 2.48000e+03, 2.69860e+04, 7.08800e+03, 2.64200e+03,
        3.93300e+03, 7.00000e+00, 5.81700e+03, 1.51810e+04, 7.82360e+04]),
 array([0.        , 0.07068493, 0.14136986, 0.21205479, 0.28273973,
        0.35342466, 0.42410959, 0.49479452, 0.56547945, 0.63616438,
        0.70684932]),
 <a list of 10 Patch objects>)
In [319]:
row_NaN_limit = 0.3
In [320]:
gen_pop_low_nan = gen_pop_with_nan[gen_pop_with_nan['row_percent_nan'] < row_NaN_limit]
In [322]:
gen_pop_with_nan.shape[0]-gen_pop_low_nan.shape[0]
Out[322]:
105801
In [30]:
gen_pop_low_nan.drop(['row_percent_nan'], axis=1, inplace= True)
//anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
In [31]:
gen_pop_low_nan[non_numeric_col_gen_pop].head()
Out[31]:
OST_WEST_KZ EINGEFUEGT_AM D19_LETZTER_KAUF_BRANCHE CAMEO_INTL_2015 CAMEO_DEU_2015 CAMEO_DEUG_2015
1 W 1992-02-10 00:00:00 NaN 51 8A 8
2 W 1992-02-12 00:00:00 D19_UNBEKANNT 24 4C 4
3 W 1997-04-21 00:00:00 D19_UNBEKANNT 12 2A 2
4 W 1992-02-12 00:00:00 D19_SCHUHE 43 6B 6
5 W 1992-02-12 00:00:00 D19_ENERGIE 54 8C 8
In [32]:
gen_pop_low_nan['OST_WEST_KZ'].value_counts()
Out[32]:
W    619286
O    166134
Name: OST_WEST_KZ, dtype: int64
In [33]:
gen_pop_low_nan.OST_WEST_KZ = gen_pop_low_nan.OST_WEST_KZ.replace({"W": 0, "O": 1}) 
//anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:5096: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
In [34]:
gen_pop_low_nan['EINGEFUEGT_AM'] = pd.to_datetime(gen_pop_low_nan['EINGEFUEGT_AM'])
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [35]:
gen_pop_low_nan['EINGEFUEGT_AM_weekday'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.weekday())
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [36]:
gen_pop_low_nan['EINGEFUEGT_AM_day'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.day)
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [37]:
gen_pop_low_nan['EINGEFUEGT_AM_month'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.month)
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [38]:
gen_pop_low_nan['EINGEFUEGT_AM_year'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.year)
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [39]:
gen_pop_low_nan.drop(['EINGEFUEGT_AM'], axis= 1, inplace= True)
In [40]:
gen_pop_low_nan['CAMEO_INTL_2015_adj'] = gen_pop_low_nan['CAMEO_INTL_2015'].apply(lambda x: int(x) if x == x else np.nan)
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [41]:
gen_pop_low_nan['CAMEO_INTL_2015_adj'].value_counts().shape
Out[41]:
(21,)
In [42]:
gen_pop_low_nan.drop(['CAMEO_INTL_2015'], axis=1, inplace= True)
In [8]:
def getWealth(col):
    lst = []
    for item in col:
        x = float(item)
        if x > 50:
            lst.append(0)
        elif x > 35:
            lst.append(1)
        elif x > 25:
            lst.append(2)
        elif x > 15:
            lst.append(3)
        elif x > -1:
            lst.append(4)
        else:
            lst.append(np.nan)
    return lst

def getLifeStage(col):
    lst = []
    for item in col:
        if str(item)[1]== '1':
            lst.append(0)
        elif str(item)[1]== '2':
            lst.append(1)
        elif str(item)[1]== '3':
            lst.append(2)
        elif str(item)[1]== '4':
            lst.append(3)
        elif str(item)[1]== '5':
            lst.append(4)
        else:
            lst.append(np.nan)
    return lst
In [44]:
gen_pop_low_nan["wealth"] = getWealth(gen_pop_low_nan["CAMEO_INTL_2015_adj"])

gen_pop_low_nan["life_stage"] = getLifeStage(gen_pop_low_nan['CAMEO_INTL_2015_adj'])
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [45]:
gen_pop_low_nan.drop(['CAMEO_INTL_2015_adj'], axis= 1, inplace= True)
In [46]:
gen_pop_low_nan['CAMEO_DEUG_2015_adj'] = gen_pop_low_nan['CAMEO_DEUG_2015'].apply(lambda x: int(x) if x == x else np.nan)
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [47]:
gen_pop_low_nan['CAMEO_DEUG_2015_adj'].value_counts()
Out[47]:
8.0    132942
9.0    106296
6.0    104762
4.0    102620
3.0     85436
2.0     82326
7.0     76817
5.0     54353
1.0     35870
Name: CAMEO_DEUG_2015_adj, dtype: int64
In [48]:
gen_pop_low_nan.drop(['CAMEO_DEUG_2015'], axis= 1, inplace= True)
In [49]:
gen_pop_low_nan['D19_LETZTER_KAUF_BRANCHE'].value_counts()
Out[49]:
D19_UNBEKANNT             188143
D19_VERSICHERUNGEN         55659
D19_SONSTIGE               43050
D19_VOLLSORTIMENT          33465
D19_SCHUHE                 31208
D19_BUCH_CD                27738
D19_VERSAND_REST           25062
D19_DROGERIEARTIKEL        23137
D19_BANKEN_DIREKT          22398
D19_BEKLEIDUNG_REST        20890
D19_HAUS_DEKO              20098
D19_TELKO_MOBILE           13904
D19_ENERGIE                11677
D19_TELKO_REST             11064
D19_BANKEN_GROSS           10140
D19_BEKLEIDUNG_GEH          9862
D19_KINDERARTIKEL           6978
D19_FREIZEIT                6966
D19_TECHNIK                 6731
D19_LEBENSMITTEL            6214
D19_BANKEN_REST             5042
D19_RATGEBER                4744
D19_NAHRUNGSERGAENZUNG      3933
D19_DIGIT_SERV              3453
D19_REISEN                  2999
D19_TIERARTIKEL             2480
D19_SAMMELARTIKEL           2349
D19_HANDWERK                2151
D19_WEIN_FEINKOST           2088
D19_GARTEN                  1578
D19_BANKEN_LOKAL            1391
D19_BIO_OEKO                1175
D19_BILDUNG                  937
D19_LOTTO                    799
D19_KOSMETIK                 761
Name: D19_LETZTER_KAUF_BRANCHE, dtype: int64
In [50]:
gen_pop_low_nan['CAMEO_DEU_2015'].value_counts()
Out[50]:
6B    56079
8A    51918
4C    47265
2D    34718
3C    34379
7A    34003
3D    33972
8B    32849
4A    32657
8C    30754
9D    28090
9B    27288
9C    24577
7B    24134
9A    20394
2C    19192
8D    17421
6E    15970
2B    15302
5D    14692
6C    14648
2A    13114
5A    11878
1D    11845
1A    10722
5B    10242
3A    10093
5C     9780
4B     8938
7C     8900
4D     8493
3B     6992
6A     6709
6D     6018
9E     5947
6F     5338
4E     5267
7D     5199
1E     5009
7E     4581
1C     4266
5F     4237
1B     4028
5E     3524
Name: CAMEO_DEU_2015, dtype: int64
In [51]:
gen_pop_low_nan.shape
Out[51]:
(785420, 368)
In [52]:
gc.collect()
Out[52]:
80
In [53]:
gen_pop_low_nan_encoded = pd.get_dummies(gen_pop_low_nan, columns = ['D19_LETZTER_KAUF_BRANCHE','CAMEO_DEU_2015'])
In [3]:
# gen_pop_low_nan_encoded.to_csv('gen_pop_low_nan_encoded.csv')
# gen_pop_low_nan_encoded = pd.read_csv('gen_pop_low_nan_encoded.csv')

More mixed features engineering

wealth: 0 poor: 1,2,5,6,14,15,21,24,29,31 1 mid: 3,4,7,8,16,22,25,30,32 2 rich: 10,13,18,20,23,28,35,39,40

status: 0 single:1-13 and 21-23 1 couple:14-20 2 family: 24-40

age: 0: 1,3,14,18,29,30,33,34,35 1: 2,4,39 2: 5,7,11,37 3: 6, 8,12, 13, 15,16,19,20, 31,32,36,38,40

In [9]:
def wealth_soc_stat_age(col):
    """ A function that calculates wealth, social status and age features from LP_LEBENSPHASE_FEIN
    Input:
    col: the original LP_LEBENSPHASE_FEIN col
    Output:
    wealth: wealth feature - Series
    stat: social status feature - Series
    age: age feature - Series
    """
    wealth = []
    stat = []
    age = []
    
    for item in col:
        if item in [1,2,5,6,14,15,21,24,29,31]:
            wealth.append(0)
        elif item in [3,4,7,8,16,22,25,30,32]:
            wealth.append(1)
        elif item in [10,13,18,20,23,28,35,39,40]:
            wealth.append(2)
        else:
            wealth.append(np.nan)
            
    for item in col:
        if item<=13 or 21<=item<=23:
            stat.append(0)
        elif 14<=item<=20:
            stat.append(1)
        elif item>=24:
            stat.append(2)
        else:
            stat.append(np.nan)
            
    for item in col:
        if item in [1,3,14,18,29,30,33,34,35]:
            age.append(0)
        elif item in [2,4,39]:
            age.append(1)
        elif item in [5,7,11,37]:
            age.append(2)
        elif item in [6, 8,12, 13, 15,16,19,20, 31,32,36,38,40]:
            age.append(3)
        else:
            age.append(np.nan)
            
    return wealth, stat, age
    
    
In [10]:
def wealth_soc_stat_LP_LEBENSPHASE_GROB(col):
    """ A function that calculates wealth, social status features from LP_LEBENSPHASE_GROB
    Input:
    col: the original LP_LEBENSPHASE_GROB col
    Output:
    wealth: wealth feature - Series
    stat: social status feature - Series
    """
    wealth = []
    stat = []
    
    for item in col:
        if item in [1,2,4,7,10]:
            wealth.append(0)
        elif item in [9]:
            wealth.append(1)
        elif item in [3,5,8,11,12]:
            wealth.append(2)
        else:
            wealth.append(np.nan)
            
    for item in col:
        if item <=7:
            stat.append(0)
        elif item > 7:
            stat.append(1)
        else:
            stat.append(np.nan)
        
    return wealth, stat
In [56]:
gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_wealth'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_stat'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_age'] = wealth_soc_stat_age(gen_pop_low_nan_encoded.LP_LEBENSPHASE_FEIN)
In [57]:
gen_pop_low_nan_encoded['LP_LEBENSPHASE_GROB_wealth'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_GROB_stat'] = wealth_soc_stat_LP_LEBENSPHASE_GROB(gen_pop_low_nan_encoded.LP_LEBENSPHASE_GROB)
In [11]:
def getDecade(col):
    """ a function that calculated the decade of a person's youth movement from PRAEGENDE_JUGENDJAHRE
    Input:
    col: PRAEGENDE_JUGENDJAHRE
    output: 
    lst: the decade of a person's youth movement - list
    """
    lst = []
    for item in col:
        if item==1 or item==2:
            lst.append(1)
        elif item==3 or item==4:
            lst.append(2)
        elif item>4 and item<8:
            lst.append(3)
        elif item==8 or item==9:
            lst.append(4)
        elif item>9 and item<14:
            lst.append(5)
        elif item==14 or item==15:
            lst.append(6)
        else:
            lst.append(np.nan)
    
    return lst

def getMovement(col):
    """ a function that calculates the movement of interest during a person's youth from  PRAEGENDE_JUGENDJAHRE
    input:
    col: PRAEGENDE_JUGENDJAHRE
    output:
    lst: the movement of interest during a person's youth - list
    """
    lst = []
    for item in col:
        if item in [1, 3, 5, 8, 10, 12, 14]:
            lst.append(0)
        elif item in [2, 4, 6, 7, 9, 11, 13, 15]:
            lst.append(1)
        else:
            lst.append(np.nan)
    
    return lst
In [59]:
gen_pop_low_nan_encoded['Movement'] = getMovement(gen_pop_low_nan_encoded['PRAEGENDE_JUGENDJAHRE'])
gen_pop_low_nan_encoded['Decade']= getDecade(gen_pop_low_nan_encoded['PRAEGENDE_JUGENDJAHRE'])
In [60]:
gen_pop_low_nan_encoded['WOHNLAGE_adj'] = gen_pop_low_nan_encoded.WOHNLAGE.apply(lambda x: x if x < 7 else np.nan)
In [61]:
gen_pop_low_nan_encoded.drop(['WOHNLAGE'], axis= 1, inplace= True)
In [62]:
gen_pop_low_nan_encoded['PLZ8_BAUMAX_adj'] = gen_pop_low_nan_encoded.PLZ8_BAUMAX.apply(lambda x: x if x < 5 else np.nan)
In [63]:
gen_pop_low_nan_encoded.drop(['PLZ8_BAUMAX'], axis= 1, inplace= True)
In [65]:
gen_pop_low_nan_encoded.drop(["ALTER_KIND1","ALTER_KIND2","ALTER_KIND3","ALTER_KIND4"], axis= 1, inplace= True)
In [66]:
gen_pop_low_nan_encoded.shape
Out[66]:
(785420, 448)
In [67]:
gc.collect()
Out[67]:
74
In [73]:
# gen_pop_low_nan_encoded.to_csv('gen_pop_low_nan_encoded_mixed_resolved_pending_eng.csv')
In [71]:
del gen_pop, gen_pop_low_nan, gen_pop_with_nan
In [72]:
gc.collect()
Out[72]:
7707
In [73]:
idx = gen_pop_low_nan_encoded.columns.values[1:]
for df in [gen_pop_low_nan_encoded]:
    df['sum'] = df[idx].sum(axis=1)  
    df['min'] = df[idx].min(axis=1)
    df['max'] = df[idx].max(axis=1)
    df['mean'] = df[idx].mean(axis=1)
    df['std'] = df[idx].std(axis=1)
    df['skew'] = df[idx].skew(axis=1)
    df['kurt'] = df[idx].kurtosis(axis=1)
    df['med'] = df[idx].median(axis=1)
In [327]:
# gen_pop_low_nan_encoded.to_pickle('gen_pop_low_nan_encoded_few_eng_feat.pkl')
gen_pop_low_nan_encoded = pd.read_pickle('gen_pop_low_nan_encoded_few_eng_feat.pkl')
In [328]:
gen_pop_low_nan_encoded.drop(columns=['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'PRAEGENDE_JUGENDJAHRE'], axis= 1, inplace= True)
In [324]:
gen_pop_low_nan_encoded.head()
Out[324]:
LNR AKT_DAT_KL ALTER_HH ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL ANZ_KINDER ANZ_PERSONEN ANZ_STATISTISCHE_HAUSHALTE ANZ_TITEL ... WOHNLAGE_adj PLZ8_BAUMAX_adj sum min max mean std skew kurt med
1 910220 9.0 NaN 21.0 11.0 0.0 0.0 2.0 12.0 0.0 ... 4.0 1.0 10009.0 0.0 2004.0 22.956422 195.586616 9.738685 94.776432 2.0
2 910225 9.0 17.0 17.0 10.0 0.0 0.0 1.0 7.0 0.0 ... 2.0 1.0 9757.0 0.0 2000.0 21.827740 190.308528 10.083373 101.529590 2.0
3 910226 1.0 13.0 13.0 1.0 0.0 0.0 0.0 2.0 0.0 ... NaN 1.0 9607.0 0.0 1998.0 21.784580 190.400287 10.109462 101.864963 2.0
4 910241 1.0 20.0 14.0 3.0 0.0 0.0 4.0 3.0 0.0 ... 3.0 2.0 9528.0 0.0 2004.0 21.315436 188.208420 10.299981 105.396267 2.0
5 910244 1.0 10.0 10.0 5.0 0.0 0.0 1.0 2.0 0.0 ... NaN 1.0 10255.0 0.0 1994.0 22.993274 196.303050 9.573262 91.164293 2.0

5 rows × 453 columns

Now we start the imputation, scaling process and PCA to start the segmentation process

In [329]:
imp = SimpleImputer(missing_values= np.nan, strategy='mean')
gen_pop_no_nan = imp.fit_transform(gen_pop_low_nan_encoded)
scaler = RobustScaler()
gen_pop_no_nan_scaled = scaler.fit_transform(gen_pop_no_nan)
In [330]:
pca = PCA()
X_pca = pca.fit_transform(gen_pop_no_nan_scaled)

Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

In [331]:
n = len(pca.explained_variance_ratio_)
vr = pca.explained_variance_ratio_
index = np.arange(1, n + 1)
cum_vr = np.cumsum(vr)
    
plt.figure()
ax = plt.subplot()
ax.plot(index, cum_vr)
    
ax.set_xlabel("Principal Component")
ax.set_ylabel("Explained Variance")
plt.savefig('pca_vr.png')
In [332]:
# Re-apply PCA to the data while selecting for number of components to retain.
pca = PCA(n_components=100, random_state=42)
X_pca = pca.fit_transform(gen_pop_no_nan_scaled)
In [333]:
pca.explained_variance_ratio_.sum()
Out[333]:
0.9021340710151671
In [334]:
def getWeight(pca, component, df, abs):
    """ A function that calculates the weights of features for each component of pca
    input:
    pca: pca of interest
    component: the number for the component we're interested in investigating - int
    df: the dataframe utilized
    abs: a flag to get the absolute value or not - bool
    
    """
    weights = {}
    index_for_weights = df.columns
    for i in range(pca.components_.shape[1]):
        if abs:
            weights[index_for_weights[i]] = abs(pca.components_[component-1][i])
        else:
            weights[index_for_weights[i]] = pca.components_[component-1][i]

    sort_weights = sorted(weights.items(), key= operator.itemgetter(1))
    return sort_weights
In [335]:
getWeight(pca, 1, gen_pop_low_nan_encoded, False)
Out[335]:
[('D19_HAUS_DEKO', -0.01204506005626526),
 ('D19_BEKLEIDUNG_REST', -0.011891248251020819),
 ('Movement', -0.011179026544513056),
 ('D19_KOSMETIK', -0.01104645772364246),
 ('D19_REISEN', -0.010847353880778987),
 ('D19_KINDERARTIKEL', -0.010467096718028583),
 ('KBA05_GBZ', -0.0104492471407466),
 ('D19_TELKO_MOBILE', -0.010352336613163985),
 ('MOBI_REGIO', -0.010213617669301518),
 ('D19_HANDWERK', -0.010038447468377438),
 ('D19_BANKEN_DIREKT', -0.00928839882910659),
 ('ANZ_PERSONEN', -0.008879275049852624),
 ('D19_VERSAND_REST', -0.008740906796710458),
 ('D19_TELKO_REST', -0.007903415872586654),
 ('D19_DROGERIEARTIKEL', -0.007689170910953993),
 ('D19_SAMMELARTIKEL', -0.007320965851109183),
 ('D19_VERSAND_ANZ_12', -0.007219100444147071),
 ('D19_FREIZEIT', -0.007165606878100319),
 ('D19_GESAMT_ANZ_24', -0.006934704513188864),
 ('KBA13_AUTOQUOTE', -0.006763741384525784),
 ('MOBI_RASTER', -0.006630832878307195),
 ('D19_BEKLEIDUNG_GEH', -0.006262964450059099),
 ('GEBAEUDETYP_RASTER', -0.006244774778615672),
 ('D19_RATGEBER', -0.006195566972831489),
 ('PLZ8_ANTG1', -0.0061459845688914515),
 ('D19_BILDUNG', -0.005871124945466551),
 ('FIRMENDICHTE', -0.005858780886487665),
 ('LP_LEBENSPHASE_FEIN_wealth', -0.00567198544923464),
 ('D19_BANKEN_ONLINE_QUOTE_12', -0.005592367804737787),
 ('ONLINE_AFFINITAET', -0.005578862748961736),
 ('D19_SCHUHE', -0.0055441834810452395),
 ('D19_VERSAND_ANZ_24', -0.0054600752404421775),
 ('D19_BANKEN_GROSS', -0.005430407780616217),
 ('KBA13_HALTER_55', -0.0054192672222157295),
 ('KBA13_GBZ', -0.005319783662688892),
 ('KBA05_ANTG1', -0.00530624711785668),
 ('PLZ8_GBZ', -0.005172161749675779),
 ('D19_WEIN_FEINKOST', -0.005022999196886828),
 ('D19_BANKEN_REST', -0.004947274426369961),
 ('KBA13_ALTERHALTER_60', -0.0048795870097076),
 ('LP_STATUS_GROB', -0.004863597563806525),
 ('D19_GESAMT_ANZ_12', -0.0047437378541780074),
 ('SEMIO_DOM', -0.004703178459073464),
 ('KBA05_ANTG2', -0.004653315191929169),
 ('LP_FAMILIE_GROB', -0.004571237674394199),
 ('LP_STATUS_FEIN', -0.004438448618665328),
 ('D19_VERSICHERUNGEN', -0.004437078303091387),
 ('wealth', -0.004427301887850976),
 ('VHA', -0.004395598673693874),
 ('KBA13_KW_110', -0.004358426765969317),
 ('D19_LEBENSMITTEL', -0.004295993418450931),
 ('FINANZ_MINIMALIST', -0.004288967130815223),
 ('D19_SOZIALES', -0.0042392108894427506),
 ('LP_FAMILIE_FEIN', -0.004218242350982056),
 ('LP_LEBENSPHASE_GROB_wealth', -0.004202464140209497),
 ('ALTER_HH', -0.00409345902925474),
 ('KBA13_KMH_211', -0.00401783047479815),
 ('KBA13_KMH_250', -0.003996223159476403),
 ('D19_ENERGIE', -0.003952712314925474),
 ('KBA13_HALTER_50', -0.003938661840193228),
 ('D19_BIO_OEKO', -0.003836150081756482),
 ('KBA13_HALTER_20', -0.0035898239788393705),
 ('LP_LEBENSPHASE_FEIN_stat', -0.003541228655565868),
 ('KONSUMNAEHE', -0.003440913788308449),
 ('D19_NAHRUNGSERGAENZUNG', -0.0033730051474015216),
 ('D19_GARTEN', -0.0033093235401989345),
 ('GREEN_AVANTGARDE', -0.003246778181539142),
 ('INNENSTADT', -0.0032071979949729163),
 ('KBA13_CCM_2000', -0.003202012278693277),
 ('UMFELD_ALT', -0.0031973504757256905),
 ('RT_UEBERGROESSE', -0.0031847950000432118),
 ('std', -0.0030827963252056728),
 ('D19_VOLLSORTIMENT', -0.003075331028083879),
 ('GEBURTSJAHR', -0.003023295044065905),
 ('D19_BUCH_CD', -0.0030223900227119324),
 ('D19_TIERARTIKEL', -0.0029856833384345983),
 ('KBA13_ANTG1', -0.0028938468322806405),
 ('D19_LOTTO', -0.002866958648690529),
 ('D19_TECHNIK', -0.0028491846820044224),
 ('LP_LEBENSPHASE_GROB_stat', -0.0028406169290432046),
 ('D19_SONSTIGE', -0.0028014946460921115),
 ('KBA13_SEG_GELAENDEWAGEN', -0.002725383145831273),
 ('KBA13_KRSAQUOT', -0.002703733362308084),
 ('D19_BANKEN_ANZ_24', -0.0027036619394906476),
 ('D19_VERSAND_ONLINE_QUOTE_12', -0.002571908339393499),
 ('D19_VERSI_ANZ_24', -0.0025693772641124293),
 ('ALTERSKATEGORIE_FEIN', -0.00253552069910996),
 ('GEMEINDETYP', -0.002474721638818286),
 ('KBA13_SEG_VAN', -0.002449853952244353),
 ('KBA13_KRSSEG_VAN', -0.002409694592616963),
 ('SEMIO_KULT', -0.0024073326022747),
 ('SEMIO_MAT', -0.0024042570492839818),
 ('FINANZ_ANLEGER', -0.002378289390474196),
 ('KBA13_HALTER_65', -0.0023756643215838507),
 ('KBA13_SEG_GROSSRAUMVANS', -0.002297542727164334),
 ('HEALTH_TYP', -0.002282232912302148),
 ('KBA13_ANZAHL_PKW', -0.002211824935864895),
 ('WOHNDAUER_2008', -0.0021603524656440445),
 ('KBA13_SEG_MINIVANS', -0.002117740910665856),
 ('KBA13_MERCEDES', -0.002078859047101267),
 ('KBA13_HALTER_60', -0.0020699118313816686),
 ('SEMIO_SOZ', -0.0020493455763971762),
 ('D19_DIGIT_SERV', -0.0020110765157039744),
 ('KBA13_BJ_2008', -0.00198956544402925),
 ('KBA13_SEG_MITTELKLASSE', -0.0019480185083485336),
 ('ANZ_KINDER', -0.0019469036681866133),
 ('KBA13_VORB_0', -0.0018033744374663695),
 ('KBA13_CCM_2501', -0.0017893382897765247),
 ('KBA13_CCM_2500', -0.001782554839301307),
 ('KBA13_CCM_1401_2500', -0.001769208351024004),
 ('KBA13_CCM_3000', -0.0017683941288398075),
 ('KBA13_HERST_BMW_BENZ', -0.0017169862101832176),
 ('SEMIO_RAT', -0.0015967985178888928),
 ('D19_BANKEN_ANZ_12', -0.0015949151067244029),
 ('KBA13_AUDI', -0.0015731967179489867),
 ('KBA13_SEG_OBEREMITTELKLASSE', -0.00156072785428265),
 ('D19_GESAMT_ONLINE_QUOTE_12', -0.0015605881055259507),
 ('KBA13_SITZE_6', -0.001552705307879793),
 ('FINANZ_SPARER', -0.001532272476624564),
 ('BALLRAUM', -0.0015210995901717675),
 ('FINANZTYP', -0.0015010129374675423),
 ('D19_VERSI_ANZ_12', -0.0014781454449466897),
 ('KBA13_KRSHERST_BMW_BENZ', -0.0014491646856379522),
 ('D19_BANKEN_LOKAL', -0.0013878357215695754),
 ('KBA13_ALTERHALTER_61', -0.0012970873655181669),
 ('KBA13_KW_121', -0.0012807536021872295),
 ('KBA13_HALTER_66', -0.0012448860020480442),
 ('CJT_TYP_1', -0.001236597600803382),
 ('MIN_GEBAEUDEJAHR', -0.0012135000778569654),
 ('sum', -0.0011977640439289417),
 ('KBA13_KMH_210', -0.0011932439353844358),
 ('CJT_GESAMTTYP', -0.0011274837120294162),
 ('D19_TELKO_ANZ_24', -0.0011255345369379232),
 ('UMFELD_JUNG', -0.0011057800945243911),
 ('KBA13_KW_90', -0.0010959209808020678),
 ('KBA13_HERST_AUDI_VW', -0.001048812578287534),
 ('KBA13_KW_120', -0.0010396450634181772),
 ('KBA13_KW_61_120', -0.0010379245306660996),
 ('KBA13_KRSZUL_NEU', -0.0010283862983037212),
 ('KBA13_OPEL', -0.0010221084105876681),
 ('LNR', -0.000977173575460318),
 ('KBA13_HERST_FORD_OPEL', -0.0009423233207352155),
 ('life_stage', -0.0008417597046889721),
 ('SEMIO_REL', -0.0008366159440485427),
 ('mean', -0.0008203202833632953),
 ('KBA13_BMW', -0.0008098335455069079),
 ('SEMIO_VERT', -0.0008095872913285063),
 ('SEMIO_FAM', -0.000766301195176854),
 ('KBA13_BJ_2009', -0.0006692878124701637),
 ('KBA13_VW', -0.0006631578117751837),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN', -0.000653812861596206),
 ('KBA05_ANTG3', -0.0006503148164434296),
 ('KBA13_KMH_251', -0.0006468088794886167),
 ('HH_DELTA_FLAG', -0.0006062409734639988),
 ('D19_TELKO_ANZ_12', -0.0006058435347569617),
 ('UNGLEICHENN_FLAG', -0.0006013393822404493),
 ('KBA13_KRSSEG_OBER', -0.0005988545576800577),
 ('CAMEO_DEU_2015_2D', -0.0005485612677826006),
 ('KBA13_SITZE_5', -0.000548262547599975),
 ('SEMIO_PFLICHT', -0.0004965239058869974),
 ('CAMEO_DEU_2015_6B', -0.0004900145827222547),
 ('CAMEO_DEU_2015_9A', -0.0004416417579914669),
 ('KBA13_KRSHERST_AUDI_VW', -0.00043753757757547377),
 ('RT_KEIN_ANREIZ', -0.0004338750185504948),
 ('CJT_TYP_2', -0.00043095595741496427),
 ('CAMEO_DEU_2015_4A', -0.0004234044205844028),
 ('CAMEO_DEU_2015_4C', -0.00041792919398041364),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE', -0.00039685723578110924),
 ('SEMIO_TRADV', -0.00038896316622374677),
 ('NATIONALITAET_KZ', -0.00038631095095344295),
 ('KBA13_KRSHERST_FORD_OPEL', -0.000380959588562445),
 ('CAMEO_DEU_2015_3C', -0.0003382425647374523),
 ('KBA13_CCM_3001', -0.00033275547713786856),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE', -0.00031559190938599814),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD', -0.0003108936209022641),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT', -0.0003033064090728875),
 ('KBA13_SEG_WOHNMOBILE', -0.0002861551935630083),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST', -0.0002767465575955139),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT', -0.00025717103867535794),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL', -0.00025701715832121655),
 ('CAMEO_DEU_2015_3D', -0.00025499153997990247),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO', -0.0002498505578683105),
 ('DSL_FLAG', -0.00023470435904020294),
 ('KBA13_BJ_1999', -0.0002305302587147103),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST', -0.0002281570618234985),
 ('KBA13_VORB_2', -0.000222054786255458),
 ('CAMEO_DEU_2015_1D', -0.00021972868252578001),
 ('CAMEO_DEU_2015_8C', -0.00020800018322784987),
 ('D19_VERSI_ONLINE_QUOTE_12', -0.0002058113901034781),
 ('CAMEO_DEU_2015_5D', -0.00019627751976014534),
 ('EINGEFUEGT_AM_weekday', -0.00018843133030812764),
 ('CAMEO_DEU_2015_2C', -0.00018533502055547487),
 ('CAMEO_DEU_2015_1A', -0.0001822099982512456),
 ('Decade', -0.00017464191452838622),
 ('CAMEO_DEU_2015_7A', -0.0001682923211673114),
 ('CAMEO_DEU_2015_2B', -0.00016049939012600356),
 ('CAMEO_DEU_2015_2A', -0.0001602615978610014),
 ('KBA13_FORD', -0.0001566458259925366),
 ('CAMEO_DEU_2015_5B', -0.00014675618670143662),
 ('CAMEO_DEU_2015_1E', -0.00014155734834183208),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH', -0.00013812566765570097),
 ('CAMEO_DEU_2015_4B', -0.00013578389377276624),
 ('D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE', -0.00013166613491508337),
 ('EINGEFUEGT_AM_day', -0.00012632779256316402),
 ('CAMEO_DEU_2015_8A', -0.00012627907706660419),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE', -0.00012381144219011835),
 ('CAMEO_DEU_2015_5C', -0.0001228803161399982),
 ('KBA13_MOTOR', -0.0001171143662076556),
 ('CAMEO_DEU_2015_6E', -0.00011002351488139715),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS', -0.00010808756261433322),
 ('CAMEO_DEU_2015_3B', -0.00010740917969537917),
 ('KBA13_SEG_KOMPAKTKLASSE', -8.838730385252207e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST', -8.744401097986548e-05),
 ('CAMEO_DEU_2015_1C', -8.457487250861712e-05),
 ('CAMEO_DEU_2015_4D', -7.801228144920724e-05),
 ('CAMEO_DEU_2015_3A', -7.587367126129623e-05),
 ('KBA13_MAZDA', -7.540632986439355e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL', -7.332869269329317e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK', -7.292059492650702e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT', -7.209268342453952e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL', -7.01836191637828e-05),
 ('KBA13_CCM_1800', -6.629420879799427e-05),
 ('CAMEO_DEU_2015_4E', -6.486624413851108e-05),
 ('CAMEO_DEU_2015_5A', -5.992122085067372e-05),
 ('D19_TELKO_ONLINE_QUOTE_12', -5.9366764922968974e-05),
 ('CAMEO_DEU_2015_6A', -5.137939914119305e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER', -4.256397312211562e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST', -4.2348445031100074e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG', -3.950453707715953e-05),
 ('CAMEO_DEU_2015_1B', -3.868367140446413e-05),
 ('ANZ_TITEL', -3.289599410245105e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_REISEN', -2.763041061782146e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST', -2.6265871710888186e-05),
 ('SOHO_KZ', -2.4844869599859794e-05),
 ('CAMEO_DEU_2015_7E', -2.46432458559856e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL', -2.2924683555961626e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK', -2.216409841382518e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_GARTEN', -2.2111646894943302e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL', -2.0129323623755395e-05),
 ('CAMEO_DEU_2015_6F', -1.8313348414536926e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL', -1.72341224753538e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV', -1.6850146693040398e-05),
 ('CAMEO_DEU_2015_6D', -1.3845861896183268e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO', -1.3532622975864966e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK', -9.586148631417974e-06),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LOTTO', -8.768728674663026e-06),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG', -6.6416524705768456e-06),
 ('min', -0.0),
 ('TITEL_KZ', 2.8740837072812713e-06),
 ('KBA13_HALTER_45', 3.762268539477155e-06),
 ('KBA13_CCM_1500', 2.0496039011083174e-05),
 ('KBA13_SEG_SPORTWAGEN', 3.3051459621703394e-05),
 ('KBA13_VORB_3', 3.7161445905737836e-05),
 ('CAMEO_DEU_2015_5F', 5.953542022136672e-05),
 ('KBA13_SEG_OBERKLASSE', 6.878306220679354e-05),
 ('EINGEFUEGT_AM_month', 6.966140359418158e-05),
 ('CAMEO_DEU_2015_6C', 8.208959850000428e-05),
 ('CAMEO_DEU_2015_5E', 8.713448146434869e-05),
 ('KBA13_BJ_2000', 0.00010453977674814819),
 ('D19_TELKO_ONLINE_DATUM', 0.0002019444996741142),
 ('GFK_URLAUBERTYP', 0.0002457254137983169),
 ('KBA13_FAB_ASIEN', 0.00025321345583047576),
 ('CAMEO_DEU_2015_7B', 0.00026056253082712357),
 ('KBA13_HALTER_25', 0.00026704134981416774),
 ('KBA13_BJ_2006', 0.0002749161342683785),
 ('KBA13_SEG_UTILITIES', 0.00027573776151431967),
 ('CAMEO_DEU_2015_8D', 0.00028248790915175534),
 ('KBA13_KW_80', 0.0003172461770973533),
 ('KBA13_TOYOTA', 0.00032182036401553896),
 ('D19_VERSI_ONLINE_DATUM', 0.00033362677474890425),
 ('KBA13_KW_70', 0.00034289099606847346),
 ('KBA13_CCM_1400', 0.000367852395939519),
 ('CAMEO_DEU_2015_8B', 0.00038410321970704694),
 ('KBA13_CCM_1600', 0.0003937928133765149),
 ('KBA13_HERST_ASIEN', 0.00039611166416851954),
 ('CAMEO_DEU_2015_7C', 0.0004393531908328778),
 ('KKK', 0.0004788937614534676),
 ('KBA13_KW_60', 0.0005022335198480849),
 ('CAMEO_DEU_2015_7D', 0.0005028859586253135),
 ('FINANZ_UNAUFFAELLIGER', 0.000505812297586679),
 ('FINANZ_VORSORGER', 0.000509890088614281),
 ('CAMEO_DEU_2015_9B', 0.000518897966370756),
 ('ANREDE_KZ', 0.0005599046426017163),
 ('SHOPPER_TYP', 0.0005897034776416032),
 ('EINGEFUEGT_AM_year', 0.0006806139820833703),
 ('CAMEO_DEU_2015_9D', 0.0007028676616619181),
 ('KBA13_BJ_2004', 0.0007110632146619047),
 ('KBA13_SEG_SONSTIGE', 0.0007576661252882051),
 ('KBA13_VORB_1', 0.0007839144017434103),
 ('D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT', 0.0007920056737519607),
 ('KBA13_CCM_1200', 0.0008248932546421989),
 ('CJT_TYP_3', 0.0008349203461352624),
 ('CAMEO_DEU_2015_9C', 0.0008517120808979838),
 ('CJT_TYP_5', 0.0008645188382901927),
 ('KBA13_NISSAN', 0.0009288085672311811),
 ('D19_BANKEN_OFFLINE_DATUM', 0.0009691840396270614),
 ('SEMIO_KAEM', 0.0010167283860491063),
 ('KBA13_FIAT', 0.001028100088357275),
 ('LP_LEBENSPHASE_FEIN_age', 0.0010530819745549617),
 ('D19_VERSI_OFFLINE_DATUM', 0.0010662258209162244),
 ('KBA13_KMH_140', 0.0010998470715246182),
 ('KBA13_SITZE_4', 0.0011083067290137943),
 ('ANZ_HH_TITEL', 0.0011671450160742284),
 ('CJT_TYP_6', 0.001201265969511162),
 ('OST_WEST_KZ', 0.0012285448683742684),
 ('KBA13_PEUGEOT', 0.001257802318071084),
 ('KBA13_KMH_140_210', 0.0012751702134970416),
 ('EINGEZOGENAM_HH_JAHR', 0.0012783383889174557),
 ('ALTERSKATEGORIE_GROB', 0.0013377911933094447),
 ('KBA13_KMH_180', 0.0013607475380031189),
 ('KBA13_RENAULT', 0.0014258483301908553),
 ('KBA13_KW_0_60', 0.0014655767892592265),
 ('CAMEO_DEU_2015_9E', 0.0014712403743071716),
 ('KBA13_KRSSEG_KLEIN', 0.001526363039048667),
 ('STRUKTURTYP', 0.0016098878938623042),
 ('KBA13_KMH_110', 0.001640671463341816),
 ('KBA13_VORB_1_2', 0.0017444267603426386),
 ('KBA13_HALTER_40', 0.0017629691316897614),
 ('KBA13_KW_40', 0.0018000082253257432),
 ('EWDICHTE', 0.001835460776034751),
 ('KBA13_CCM_1000', 0.0018362205467111331),
 ('KBA13_HERST_EUROPA', 0.0018524580840241069),
 ('KBA13_KW_30', 0.0018662698759185795),
 ('SEMIO_ERL', 0.0018905651369712216),
 ('VERS_TYP', 0.0018933549526084417),
 ('KOMBIALTER', 0.0018949305870127264),
 ('RELAT_AB', 0.001913109435299289),
 ('KBA13_ALTERHALTER_30', 0.0019205052744249254),
 ('KBA13_ALTERHALTER_45', 0.0019300745276492244),
 ('max', 0.0019654720168127537),
 ('KBA13_ANTG2', 0.00198330286235276),
 ('PLZ8_HHZ', 0.0020121292403325057),
 ('REGIOTYP', 0.0020192662633418982),
 ('KBA13_CCM_0_1400', 0.0020516712026530082),
 ('KBA13_HHZ', 0.002058008948978093),
 ('KBA13_FAB_SONSTIGE', 0.00215699501001978),
 ('KBA13_HERST_SONST', 0.00215699501001978),
 ('RETOURTYP_BK_S', 0.0021894150194284945),
 ('KBA13_KW_50', 0.0021983808717052334),
 ('CJT_TYP_4', 0.002274458491614601),
 ('D19_TELKO_OFFLINE_DATUM', 0.002334983322784511),
 ('KONSUMZELLE', 0.0023437350803427997),
 ('D19_KONSUMTYP', 0.002358534500873707),
 ('med', 0.0024899786966556935),
 ('PLZ8_ANTG2', 0.0024995602988201475),
 ('KBA13_SEG_KLEINWAGEN', 0.0025122413925594868),
 ('SEMIO_LUST', 0.0025423391103558323),
 ('KBA13_HALTER_30', 0.0027739799035492604),
 ('VERDICHTUNGSRAUM', 0.0027811147183977028),
 ('KBA13_HALTER_35', 0.0028487301912072257),
 ('KBA13_KMH_0_140', 0.0028590643236981387),
 ('SEMIO_KRIT', 0.0029098665372814803),
 ('WOHNLAGE_adj', 0.002912618270004129),
 ('GEBAEUDETYP', 0.003028681289110591),
 ('KBA13_SEG_KLEINST', 0.0032665593678878105),
 ('kurt', 0.0032818168510714376),
 ('skew', 0.003351679242775527),
 ('AKT_DAT_KL', 0.0033976189769235584),
 ('ORTSGR_KLS9', 0.0034426937584140374),
 ('CJT_KATALOGNUTZER', 0.003776981884918745),
 ('W_KEIT_KIND_HH', 0.003874892673988767),
 ('ARBEIT', 0.0038813118694626704),
 ('KBA13_SEG_MINIWAGEN', 0.0039533370398333635),
 ('VHN', 0.003984986376063327),
 ('D19_VERSAND_ONLINE_DATUM', 0.004265147497810316),
 ('VK_ZG11', 0.004543502544706553),
 ('KK_KUNDENTYP_mod', 0.004623000993936986),
 ('CAMEO_DEUG_2015_adj', 0.004682722286933696),
 ('D19_VERSAND_DATUM', 0.004691278988939944),
 ('D19_GESAMT_ONLINE_DATUM', 0.0047292479979844934),
 ('D19_KONSUMTYP_MAX', 0.004770838204533751),
 ('D19_GESAMT_DATUM', 0.004771384200535808),
 ('KBA13_ANTG4', 0.004915266375440449),
 ('HH_EINKOMMEN_SCORE', 0.004998143304889855),
 ('KBA13_BAUMAX', 0.005019259597646622),
 ('PLZ8_ANTG4', 0.005121152358034188),
 ('KBA13_ANTG3', 0.005131391459666297),
 ('PLZ8_ANTG3', 0.0051636484354620595),
 ('RT_SCHNAEPPCHEN', 0.0052458450703514155),
 ('VK_DISTANZ', 0.0059348916308099905),
 ('PLZ8_BAUMAX_adj', 0.0060238761299379285),
 ('D19_TELKO_DATUM', 0.0060377632266837376),
 ('ZABEOTYP', 0.006160823276804196),
 ('VK_DHT4A', 0.006262439864396575),
 ('D19_BANKEN_ONLINE_DATUM', 0.0065634798261977505),
 ('FINANZ_HAUSBAUER', 0.006574490369278859),
 ('D19_VERSAND_OFFLINE_DATUM', 0.008211332058724439),
 ('D19_BANKEN_DATUM', 0.008402856923206912),
 ('D19_VERSI_DATUM', 0.010145051876824346),
 ('KBA05_BAUMAX', 0.010166007696781906),
 ('D19_GESAMT_OFFLINE_DATUM', 0.011076529221950987),
 ('KBA05_ANTG4', 0.012246945110118188),
 ('KBA05_MODTEMP', 0.01733430871254857),
 ('KBA05_MAXAH', 0.021627444108880935),
 ('ANZ_HAUSHALTE_AKTIV', 0.025641159689443496),
 ('KBA05_KW2', 0.026157175875229047),
 ('KBA05_MAXBJ', 0.02826668309209693),
 ('KBA05_CCM3', 0.028983155778296046),
 ('KBA05_CCM2', 0.02931199132329707),
 ('KBA05_SEG4', 0.029709415206847986),
 ('KBA05_SEG3', 0.029714131982758022),
 ('KBA05_MOD2', 0.030313450626217972),
 ('KBA05_MAXSEG', 0.030388163479125242),
 ('ANZ_STATISTISCHE_HAUSHALTE', 0.0306595882534312),
 ('KBA05_KRSHERST1', 0.031274216795021584),
 ('KBA05_ZUL1', 0.03203426849805488),
 ('KBA05_MOD3', 0.03258774068384572),
 ('KBA05_KRSHERST2', 0.032622641170282754),
 ('KBA05_HERST2', 0.03279647616627571),
 ('KBA05_AUTOQUOT', 0.03472386896901145),
 ('KBA05_ALTER4', 0.03502106055452619),
 ('KBA05_VORB2', 0.03525842301616768),
 ('KBA05_ALTER2', 0.03532997499596683),
 ('KBA05_KRSAQUOT', 0.0355399747041501),
 ('KBA05_VORB1', 0.03559653270925678),
 ('KBA05_ALTER3', 0.03592512323536563),
 ('KBA05_KRSHERST3', 0.03673799037957795),
 ('KBA05_HERST5', 0.0369814066859424),
 ('KBA05_ZUL2', 0.037068445627124275),
 ('KBA05_HERST3', 0.03734619733456471),
 ('KBA05_VORB0', 0.03785536767969694),
 ('KBA05_DIESEL', 0.03846208660536566),
 ('KBA05_SEG10', 0.03894194999430343),
 ('KBA05_ZUL3', 0.03988664322862764),
 ('KBA05_FRAU', 0.04033504568574435),
 ('KBA05_KW3', 0.04144417955800871),
 ('KBA05_MOD1', 0.04230621703511017),
 ('KBA05_CCM4', 0.04258164454779672),
 ('KBA05_ZUL4', 0.043086947836299655),
 ('KBA05_ALTER1', 0.043241294453061505),
 ('KBA05_HERST4', 0.04331188669375167),
 ('KBA05_MOD8', 0.04457940010142115),
 ('KBA05_SEG9', 0.045430849240256026),
 ('KBA05_SEG7', 0.046119543037036956),
 ('KBA05_KW1', 0.04658115995045979),
 ('KBA05_CCM1', 0.04968166606935027),
 ('KBA05_SEG1', 0.054561839920494334),
 ('KBA05_SEG2', 0.055995616923909905),
 ('KBA05_MOD4', 0.05656244817220451),
 ('KBA05_MOTOR', 0.06369088250306176),
 ('KBA05_HERSTTEMP', 0.06747988278591896),
 ('KBA05_HERST1', 0.06802352967833578),
 ('KBA05_KRSOBER', 0.0794027812410365),
 ('KBA05_ANHANG', 0.07949888268896467),
 ('KBA05_MAXVORB', 0.07995349840737123),
 ('KBA05_KRSZUL', 0.08106336841893383),
 ('KBA05_KRSVAN', 0.08111636742160336),
 ('KBA05_SEG5', 0.08190370068189194),
 ('KBA05_MAXHERST', 0.08224640093164974),
 ('KBA05_MOTRAD', 0.08460494734215344),
 ('KBA05_SEG8', 0.09528893336795631),
 ('KBA05_SEG6', 0.10401259923121131),
 ('KBA05_KRSKLEIN', 0.9175621828832754)]

status from car ownership

In [336]:
getWeight(pca, 2, gen_pop_low_nan_encoded, False)
Out[336]:
[('D19_GESAMT_OFFLINE_DATUM', -0.15800334965636664),
 ('D19_BANKEN_DATUM', -0.14795788607080707),
 ('D19_VERSI_DATUM', -0.13622028210365306),
 ('D19_VERSAND_OFFLINE_DATUM', -0.1244457299988457),
 ('D19_BANKEN_ONLINE_DATUM', -0.12112178222343817),
 ('PLZ8_BAUMAX_adj', -0.09248974768460314),
 ('D19_TELKO_DATUM', -0.08704368001866987),
 ('ANZ_STATISTISCHE_HAUSHALTE', -0.08537222643825047),
 ('ANZ_HAUSHALTE_AKTIV', -0.08444843861124689),
 ('ZABEOTYP', -0.0779019696380158),
 ('KBA13_ANTG3', -0.07394443703529278),
 ('PLZ8_ANTG3', -0.06938942167422846),
 ('VK_DISTANZ', -0.06658224791702685),
 ('D19_GESAMT_ONLINE_DATUM', -0.06650003221369383),
 ('D19_VERSAND_DATUM', -0.0656242087162267),
 ('VK_DHT4A', -0.06229977884413125),
 ('KK_KUNDENTYP_mod', -0.0620479981758014),
 ('D19_VERSAND_ONLINE_DATUM', -0.06145013812426142),
 ('D19_GESAMT_DATUM', -0.060313022113151045),
 ('VK_ZG11', -0.05573717301218422),
 ('FINANZ_HAUSBAUER', -0.05545501098111772),
 ('D19_KONSUMTYP_MAX', -0.05436051491805552),
 ('KBA13_BAUMAX', -0.05067994540762328),
 ('KBA13_ANTG4', -0.05039984142108916),
 ('PLZ8_ANTG2', -0.048912631974295066),
 ('PLZ8_ANTG4', -0.047815802635687445),
 ('ORTSGR_KLS9', -0.04773885270587115),
 ('ARBEIT', -0.043415298552115615),
 ('HH_EINKOMMEN_SCORE', -0.0432647631923361),
 ('KBA13_ANTG2', -0.04247839458794334),
 ('D19_KONSUMTYP', -0.04094872945758861),
 ('KBA05_ANTG3', -0.039956016491344705),
 ('skew', -0.03864313994662544),
 ('W_KEIT_KIND_HH', -0.037689007504248556),
 ('kurt', -0.03724297846131674),
 ('GEBAEUDETYP', -0.03612414249905479),
 ('CAMEO_DEUG_2015_adj', -0.03612140896231528),
 ('D19_TELKO_OFFLINE_DATUM', -0.03609883280291875),
 ('STRUKTURTYP', -0.035438640465873465),
 ('KBA05_BAUMAX', -0.03338506746252722),
 ('VERDICHTUNGSRAUM', -0.03280361316837514),
 ('VHN', -0.03038938301555288),
 ('RELAT_AB', -0.03036954971260597),
 ('WOHNLAGE_adj', -0.02820910468722621),
 ('EWDICHTE', -0.027642142775135214),
 ('KBA05_ANTG4', -0.02704211602316181),
 ('AKT_DAT_KL', -0.026728265165954022),
 ('KKK', -0.025854204033841692),
 ('CJT_KATALOGNUTZER', -0.02517189760589186),
 ('REGIOTYP', -0.02075619334599136),
 ('D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT', -0.020205336123225393),
 ('KBA13_HALTER_35', -0.02001791661152729),
 ('D19_BANKEN_OFFLINE_DATUM', -0.019600399080074285),
 ('KBA13_HALTER_30', -0.019501997248129302),
 ('KONSUMZELLE', -0.018408454298872296),
 ('KBA13_HHZ', -0.017031018397502307),
 ('D19_VERSI_OFFLINE_DATUM', -0.015469272015389947),
 ('PLZ8_HHZ', -0.014863524002615846),
 ('KBA13_VORB_3', -0.01485569988042546),
 ('KBA13_SITZE_4', -0.014687317375608562),
 ('RETOURTYP_BK_S', -0.014346921814308536),
 ('KBA13_SEG_OBEREMITTELKLASSE', -0.014344328786168524),
 ('KBA13_SEG_MINIWAGEN', -0.013508967523498443),
 ('KBA13_KW_50', -0.013022623032500992),
 ('KBA13_BJ_2000', -0.012890957740458665),
 ('KBA05_ZUL1', -0.01226499687778283),
 ('KOMBIALTER', -0.011943840604083052),
 ('KBA05_VORB2', -0.011797185808313291),
 ('KBA13_FIAT', -0.011743360392074898),
 ('KBA13_ALTERHALTER_30', -0.011209017193377416),
 ('KBA13_CCM_0_1400', -0.011061549731455516),
 ('CJT_TYP_4', -0.010284421349493809),
 ('KBA13_HALTER_65', -0.009632637622337455),
 ('KBA13_HALTER_66', -0.009463479658600431),
 ('KBA13_HALTER_40', -0.009339342595063155),
 ('KBA13_HERST_BMW_BENZ', -0.009246622227289052),
 ('KBA13_ALTERHALTER_61', -0.00918721157243873),
 ('KBA13_ALTERHALTER_45', -0.008736124554521597),
 ('KBA13_KMH_0_140', -0.008420763289198505),
 ('KBA13_SEG_SPORTWAGEN', -0.008282124295081255),
 ('KBA13_BMW', -0.008260808275218839),
 ('KBA05_MAXVORB', -0.008251245115939929),
 ('KBA13_HERST_EUROPA', -0.007823340009327386),
 ('KBA13_NISSAN', -0.007580427454455843),
 ('KBA13_KW_0_60', -0.007560776862643698),
 ('VERS_TYP', -0.007499360120491402),
 ('KBA13_BJ_1999', -0.007368139756106727),
 ('KBA13_SEG_OBERKLASSE', -0.007337717324079522),
 ('KBA05_ALTER1', -0.007337697978604512),
 ('KBA13_MERCEDES', -0.007264982743389969),
 ('KBA13_KW_40', -0.007215751144818698),
 ('KBA13_CCM_1000', -0.007214069219011918),
 ('KBA13_SEG_SONSTIGE', -0.007044559883290131),
 ('KBA13_MAZDA', -0.007034613431849636),
 ('SEMIO_REL', -0.006894086680776855),
 ('SEMIO_KAEM', -0.0066909222750318525),
 ('SEMIO_KRIT', -0.006636192932034425),
 ('KBA13_FAB_SONSTIGE', -0.006503887812101961),
 ('KBA13_HERST_SONST', -0.006503887812101961),
 ('RT_KEIN_ANREIZ', -0.0064584327396288894),
 ('CAMEO_DEU_2015_8A', -0.0063574465419070645),
 ('KBA13_RENAULT', -0.006331351171450835),
 ('GFK_URLAUBERTYP', -0.006284489224208324),
 ('DSL_FLAG', -0.006012411291564607),
 ('KBA13_SEG_KOMPAKTKLASSE', -0.005793161396036228),
 ('CJT_TYP_2', -0.005619795045725877),
 ('KBA05_ANTG2', -0.005462232565055574),
 ('KBA05_MAXHERST', -0.005369206981075191),
 ('KBA13_VORB_2', -0.005093954547560202),
 ('SEMIO_LUST', -0.004844228396192548),
 ('KBA13_SEG_MITTELKLASSE', -0.004668728931640602),
 ('KBA13_HERST_ASIEN', -0.0044633499545405665),
 ('SEMIO_SOZ', -0.004420112863456329),
 ('SEMIO_FAM', -0.004384626317832229),
 ('LP_LEBENSPHASE_FEIN_age', -0.004293836961193312),
 ('ANZ_HH_TITEL', -0.004292150201946353),
 ('KBA13_CCM_1200', -0.004230992670519799),
 ('SEMIO_DOM', -0.004229989383366263),
 ('CAMEO_DEU_2015_9D', -0.004198554123245625),
 ('KBA13_KMH_180', -0.0039936692871996495),
 ('CAMEO_DEU_2015_9C', -0.003962960232446439),
 ('D19_VERSI_ONLINE_DATUM', -0.0039358375351788615),
 ('KBA13_PEUGEOT', -0.0038321248747918125),
 ('CAMEO_DEU_2015_9B', -0.0037593415085494047),
 ('SEMIO_PFLICHT', -0.003651087625858664),
 ('KBA05_KW1', -0.0036475522032752856),
 ('FINANZ_SPARER', -0.0036350807011961317),
 ('KBA13_KW_60', -0.0035605099233174966),
 ('KBA13_TOYOTA', -0.0035165664736140586),
 ('KBA13_SEG_KLEINST', -0.0034005954656676353),
 ('HEALTH_TYP', -0.003252932297703467),
 ('KBA13_CCM_1400', -0.0032415456323051),
 ('CAMEO_DEU_2015_8B', -0.0032406286717622653),
 ('SEMIO_KULT', -0.002869765562237713),
 ('D19_TELKO_ONLINE_DATUM', -0.0028221021092531603),
 ('KBA13_KRSHERST_FORD_OPEL', -0.002617803142413352),
 ('CAMEO_DEU_2015_8C', -0.0025038207500714866),
 ('KBA05_CCM2', -0.0023020098587032095),
 ('OST_WEST_KZ', -0.002131822481552167),
 ('KBA05_SEG3', -0.0019999540623105543),
 ('KBA05_HERST5', -0.0019693788985475844),
 ('CAMEO_DEU_2015_9A', -0.00196123616084573),
 ('ALTERSKATEGORIE_GROB', -0.0019527451751439006),
 ('CAMEO_DEU_2015_8D', -0.0018230901184271141),
 ('CJT_TYP_6', -0.0017375864691707764),
 ('CAMEO_DEU_2015_7B', -0.001691808803653688),
 ('KBA05_MAXSEG', -0.0016677487538800384),
 ('SEMIO_MAT', -0.0016294118146642186),
 ('KBA13_KRSSEG_KLEIN', -0.0014772816606840274),
 ('KBA05_KRSHERST3', -0.0013838953292819288),
 ('RT_SCHNAEPPCHEN', -0.0013540578759327852),
 ('NATIONALITAET_KZ', -0.0012712407530600111),
 ('EINGEZOGENAM_HH_JAHR', -0.0012217643318593013),
 ('FINANZ_VORSORGER', -0.0011182553086236625),
 ('KBA05_CCM1', -0.0009281306875584014),
 ('CAMEO_DEU_2015_6E', -0.0008310456282622853),
 ('SEMIO_ERL', -0.0008212998180921496),
 ('CAMEO_DEU_2015_7C', -0.0008033300212443738),
 ('KBA13_HALTER_25', -0.0007488876896737164),
 ('CAMEO_DEU_2015_5D', -0.0007246867239078048),
 ('CAMEO_DEU_2015_9E', -0.0006988455506135784),
 ('KBA13_FAB_ASIEN', -0.0006971657898230164),
 ('CAMEO_DEU_2015_7A', -0.0004786401276694982),
 ('KBA13_CCM_3000', -0.0004669481505178134),
 ('CAMEO_DEU_2015_7D', -0.00043248468134917626),
 ('WOHNDAUER_2008', -0.0003893259263931411),
 ('CAMEO_DEU_2015_6C', -0.0003306788086113012),
 ('ANREDE_KZ', -0.000274087004171211),
 ('CAMEO_DEU_2015_6A', -0.00025288728292565055),
 ('CAMEO_DEU_2015_6F', -0.00022856938494785582),
 ('CAMEO_DEU_2015_6D', -0.00019381426140822105),
 ('KBA13_KMH_140', -0.0001845187402048477),
 ('CAMEO_DEU_2015_7E', -0.00017813953124004548),
 ('KBA13_CCM_1600', -6.890865650705166e-05),
 ('TITEL_KZ', -5.1163165223606563e-05),
 ('CAMEO_DEU_2015_5E', -3.777003082355687e-05),
 ('min', -0.0),
 ('KBA13_SEG_KLEINWAGEN', 3.5493936979183085e-05),
 ('CAMEO_DEU_2015_6B', 4.755394552635173e-05),
 ('CAMEO_DEU_2015_5F', 6.799733104723203e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LOTTO', 7.20051764948677e-05),
 ('ANZ_TITEL', 8.191695272752283e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK', 9.018699251618582e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO', 0.0001243864966744559),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG', 0.00013594759805116223),
 ('SOHO_KZ', 0.0001373665368946826),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK', 0.00017565040992789142),
 ('D19_LETZTER_KAUF_BRANCHE_D19_GARTEN', 0.000177977716653619),
 ('D19_LETZTER_KAUF_BRANCHE_D19_REISEN', 0.00017922135546243382),
 ('SHOPPER_TYP', 0.00018485702980142776),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL', 0.0001859321305010357),
 ('KBA13_FORD', 0.00018876200416511314),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV', 0.00019967657644591347),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL', 0.00020757414377947883),
 ('CAMEO_DEU_2015_4E', 0.0002162883204653154),
 ('D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST', 0.00024967591157210695),
 ('CAMEO_DEU_2015_1B', 0.0002512042322725445),
 ('D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER', 0.00028956858906476336),
 ('CAMEO_DEU_2015_1C', 0.00029015364255221144),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST', 0.00032053291098888556),
 ('KBA13_KW_70', 0.00033059129213173236),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL', 0.0003595961823537285),
 ('CAMEO_DEU_2015_1E', 0.00038299465161545994),
 ('D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG', 0.00038451171914941155),
 ('LNR', 0.0003932088321489716),
 ('CAMEO_DEU_2015_4D', 0.00041844772376148473),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK', 0.0006288734697423051),
 ('CAMEO_DEU_2015_1A', 0.0006440212079424709),
 ('KBA05_ALTER2', 0.0006912748453209769),
 ('D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE', 0.0007237427377317887),
 ('CAMEO_DEU_2015_5B', 0.0007261841487843647),
 ('CAMEO_DEU_2015_5C', 0.0007653820556354464),
 ('CAMEO_DEU_2015_5A', 0.0007689719679032317),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST', 0.0007893296312138982),
 ('KBA13_SEG_WOHNMOBILE', 0.0008028790673513987),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE', 0.0008189963173483344),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL', 0.0008416510111443617),
 ('D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT', 0.0009039641653858466),
 ('CAMEO_DEU_2015_4B', 0.0009057396845165934),
 ('KBA05_HERST3', 0.001016827872773973),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL', 0.0010577640379915744),
 ('D19_TELKO_ONLINE_QUOTE_12', 0.0010678070357282986),
 ('KBA05_HERST4', 0.001109498014225459),
 ('KBA05_MOD3', 0.0011121802357588575),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS', 0.0011832937719053916),
 ('CJT_TYP_1', 0.0012957729227245144),
 ('KBA13_KW_121', 0.0013460258695007486),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH', 0.0013662713454045),
 ('CAMEO_DEU_2015_1D', 0.00140301169838549),
 ('KBA13_KW_30', 0.0016905985484025263),
 ('CAMEO_DEU_2015_3B', 0.0017354089394488775),
 ('KBA05_FRAU', 0.0018841186653929076),
 ('FINANZTYP', 0.001976466060044143),
 ('CAMEO_DEU_2015_3C', 0.002026852552571381),
 ('KBA13_CCM_2501', 0.00203522979201137),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE', 0.0020611926270793306),
 ('CAMEO_DEU_2015_2B', 0.0021753382506451346),
 ('KBA05_SEG4', 0.0021853527802581695),
 ('UNGLEICHENN_FLAG', 0.0022042782616964094),
 ('CAMEO_DEU_2015_3D', 0.0022694307268042685),
 ('KBA13_MOTOR', 0.0024245420082012266),
 ('D19_VERSI_ONLINE_QUOTE_12', 0.0025469483626670807),
 ('SEMIO_RAT', 0.0025575939672058704),
 ('CAMEO_DEU_2015_2A', 0.0027222929594033266),
 ('KBA13_KMH_110', 0.0027376131001066164),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST', 0.0027739482269390882),
 ('KBA05_MOD2', 0.002803113013329107),
 ('CAMEO_DEU_2015_2C', 0.002882729162318732),
 ('CAMEO_DEU_2015_2D', 0.0029033214502718783),
 ('KBA13_HERST_FORD_OPEL', 0.0029540252395517616),
 ('CAMEO_DEU_2015_3A', 0.003053535441977346),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO', 0.003149114741716443),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST', 0.0032109497220280674),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT', 0.0032819242872665624),
 ('KBA13_KMH_251', 0.003353812569830374),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL', 0.0033665905407238967),
 ('HH_DELTA_FLAG', 0.003480584454425406),
 ('KBA05_ALTER4', 0.0034927824514177524),
 ('KBA13_OPEL', 0.0035197407472685777),
 ('KBA13_KW_120', 0.0035546084888654654),
 ('KBA13_KMH_211', 0.003562215455891129),
 ('KBA05_SEG2', 0.0038299635150172958),
 ('KBA13_KMH_250', 0.003909867122257505),
 ('CAMEO_DEU_2015_4A', 0.00396325014024416),
 ('CAMEO_DEU_2015_4C', 0.003984779034044134),
 ('KBA05_SEG6', 0.004012982562767678),
 ('KBA05_KRSHERST2', 0.004075843872438296),
 ('KBA13_KRSHERST_BMW_BENZ', 0.004131711840551575),
 ('KBA05_KW2', 0.0043214238751888455),
 ('SEMIO_TRADV', 0.004337178020357382),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT', 0.004430692345991116),
 ('max', 0.004444237779423291),
 ('KBA13_CCM_1800', 0.00448346333537826),
 ('KBA05_MAXAH', 0.004498253445987891),
 ('KBA13_CCM_3001', 0.004525324398923644),
 ('KBA05_CCM4', 0.0045389853164424475),
 ('KBA05_VORB1', 0.0045705970257315254),
 ('KBA13_KW_80', 0.0045891086950064245),
 ('CJT_TYP_3', 0.004639030505561214),
 ('KBA13_CCM_2500', 0.004657111157767983),
 ('KBA13_KW_90', 0.004700828867648912),
 ('KBA05_SEG1', 0.0047164602114939565),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD', 0.004868237024164378),
 ('KBA13_KMH_140_210', 0.004953965842656034),
 ('KBA13_VORB_1_2', 0.0049997084385715405),
 ('KBA13_SITZE_5', 0.005068401561476299),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE', 0.005113328599410313),
 ('KBA05_KRSHERST1', 0.005312777722313259),
 ('KBA05_HERST1', 0.0054754675298316815),
 ('KBA05_KW3', 0.005629975499193037),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN', 0.0056752964280010875),
 ('KBA13_CCM_1500', 0.005734592201389293),
 ('KBA13_BJ_2004', 0.006044077856836811),
 ('KBA05_ZUL2', 0.006116206559969776),
 ('KBA05_MOD1', 0.0061655281972260246),
 ('KBA05_CCM3', 0.006277050950338969),
 ('KBA13_VORB_1', 0.0064379568152977225),
 ('KBA05_HERST2', 0.0064669627106494635),
 ('KBA13_KMH_210', 0.006558878661742031),
 ('KBA05_MAXBJ', 0.006652949292582125),
 ('FINANZ_UNAUFFAELLIGER', 0.006695045336975164),
 ('FINANZ_ANLEGER', 0.006900576104034378),
 ('KBA13_KRSHERST_AUDI_VW', 0.006924989058739701),
 ('KBA13_SEG_UTILITIES', 0.007054831718384752),
 ('med', 0.007104099903671668),
 ('CJT_TYP_5', 0.007448869674136662),
 ('KBA05_KRSVAN', 0.007641187591718229),
 ('KBA13_HALTER_45', 0.007874368075632084),
 ('Decade', 0.008301311129617451),
 ('KBA13_AUDI', 0.008394758584834749),
 ('KBA05_MOD4', 0.008462660114670426),
 ('CJT_GESAMTTYP', 0.008520380341550862),
 ('KBA05_KRSOBER', 0.00852452145178235),
 ('KBA13_KW_61_120', 0.008587907727467488),
 ('SEMIO_VERT', 0.008762851722491844),
 ('KBA05_VORB0', 0.008847305201884889),
 ('KBA05_SEG5', 0.009129889120823478),
 ('D19_TELKO_ANZ_12', 0.009462960039923893),
 ('KBA13_BJ_2006', 0.009822074682661509),
 ('KBA05_MOTOR', 0.009840324001906557),
 ('KBA05_ALTER3', 0.010048244894796537),
 ('UMFELD_ALT', 0.010062085860262342),
 ('KBA05_SEG7', 0.01015882731803327),
 ('KBA05_SEG8', 0.010208142076244136),
 ('KBA13_BJ_2009', 0.010801412073219848),
 ('KBA13_KRSSEG_OBER', 0.011287035713115305),
 ('KBA13_VW', 0.011484129562177352),
 ('KBA05_SEG9', 0.011486649056265437),
 ('KBA13_SITZE_6', 0.011775001080292303),
 ('KBA05_KRSZUL', 0.012000150322975372),
 ('KBA05_DIESEL', 0.012224362163859396),
 ('KBA13_KRSSEG_VAN', 0.012270300328867625),
 ('KBA13_HERST_AUDI_VW', 0.012801541478741902),
 ('KBA13_KRSZUL_NEU', 0.012895377975017733),
 ('KBA05_MOD8', 0.01306636165394911),
 ('KBA05_ZUL4', 0.01315645296162787),
 ('KBA13_SEG_GROSSRAUMVANS', 0.013244302765201789),
 ('KBA13_SEG_MINIVANS', 0.013271700325980938),
 ('KBA13_VORB_0', 0.013415183396212164),
 ('KBA05_ZUL3', 0.013467118820567461),
 ('KBA05_SEG10', 0.013937658855817435),
 ('KBA13_CCM_2000', 0.01431782957725317),
 ('KBA13_SEG_VAN', 0.014587109538149078),
 ('KBA13_KW_110', 0.015029841861584119),
 ('life_stage', 0.01652753442160328),
 ('GREEN_AVANTGARDE', 0.01704568305650539),
 ('D19_TELKO_ANZ_24', 0.017081129217637464),
 ('KBA13_CCM_1401_2500', 0.01743395555009569),
 ('UMFELD_JUNG', 0.018404879296724474),
 ('BALLRAUM', 0.019598429894273704),
 ('KBA13_HALTER_60', 0.020186083847310844),
 ('EINGEFUEGT_AM_weekday', 0.02142447552971795),
 ('RT_UEBERGROESSE', 0.021526200779116902),
 ('KBA13_ANZAHL_PKW', 0.021540222422641926),
 ('KBA13_BJ_2008', 0.02177159431932917),
 ('D19_VERSI_ANZ_12', 0.021834895893219353),
 ('ANZ_KINDER', 0.022288371138767783),
 ('KBA13_KRSAQUOT', 0.022742513777217387),
 ('D19_BANKEN_LOKAL', 0.024181695919216317),
 ('KBA05_MODTEMP', 0.024456176129776994),
 ('ALTERSKATEGORIE_FEIN', 0.02572192949653117),
 ('GEBURTSJAHR', 0.026306842558743194),
 ('D19_SONSTIGE', 0.027044785610564608),
 ('KBA05_KRSAQUOT', 0.027399855827399837),
 ('D19_TECHNIK', 0.029806543304848214),
 ('KBA05_ANTG1', 0.030279228980266016),
 ('KBA13_HALTER_20', 0.030580885936955418),
 ('KBA13_ANTG1', 0.03152984558453373),
 ('LP_LEBENSPHASE_GROB_stat', 0.031957703543676046),
 ('D19_BANKEN_ANZ_12', 0.03261611409281244),
 ('sum', 0.03263299959344532),
 ('D19_GESAMT_ONLINE_QUOTE_12', 0.03271665496784145),
 ('INNENSTADT', 0.03301376845439323),
 ('KBA05_MOTRAD', 0.03364955147515574),
 ('GEBAEUDETYP_RASTER', 0.034229686997011906),
 ('LP_LEBENSPHASE_GROB_wealth', 0.03481327248447219),
 ('LP_STATUS_FEIN', 0.03514370581453605),
 ('KBA13_SEG_GELAENDEWAGEN', 0.035158601049704315),
 ('wealth', 0.03526528927569203),
 ('D19_VOLLSORTIMENT', 0.03544186533201865),
 ('D19_BUCH_CD', 0.03600051059250311),
 ('LP_LEBENSPHASE_FEIN_stat', 0.03606254998905094),
 ('D19_DIGIT_SERV', 0.03722535980626322),
 ('D19_VERSI_ANZ_24', 0.03729658337238533),
 ('std', 0.03853385702730732),
 ('LP_STATUS_GROB', 0.038653065068396754),
 ('FINANZ_MINIMALIST', 0.03880690506332323),
 ('KBA05_AUTOQUOT', 0.03933045391958115),
 ('KBA05_KRSKLEIN', 0.03951538614579532),
 ('GEMEINDETYP', 0.04031186249454604),
 ('KBA05_GBZ', 0.04255693862708163),
 ('mean', 0.042666949178675015),
 ('LP_FAMILIE_FEIN', 0.04323022184861891),
 ('LP_LEBENSPHASE_FEIN_wealth', 0.04341867062194155),
 ('KBA05_ANHANG', 0.04363774834854772),
 ('LP_FAMILIE_GROB', 0.0452252251070444),
 ('KONSUMNAEHE', 0.045324019044789526),
 ('ALTER_HH', 0.0463518810280377),
 ('D19_LOTTO', 0.04646835414525652),
 ('D19_ENERGIE', 0.04696833458389123),
 ('KBA05_HERSTTEMP', 0.04744524061803985),
 ('D19_SOZIALES', 0.04808375862037698),
 ('D19_VERSICHERUNGEN', 0.05071174429992027),
 ('D19_BANKEN_ANZ_24', 0.05152251919284544),
 ('MOBI_REGIO', 0.05192675671420615),
 ('D19_NAHRUNGSERGAENZUNG', 0.05298052233019598),
 ('D19_TIERARTIKEL', 0.05418065670301745),
 ('D19_VERSAND_ONLINE_QUOTE_12', 0.0544699678139363),
 ('ONLINE_AFFINITAET', 0.05450939639899599),
 ('PLZ8_GBZ', 0.05452127761169879),
 ('FIRMENDICHTE', 0.05467300225052271),
 ('KBA13_HALTER_55', 0.05472728663310124),
 ('KBA13_HALTER_50', 0.05484545020012628),
 ('VHA', 0.0561519933081619),
 ('KBA13_GBZ', 0.05757328116360319),
 ('KBA13_ALTERHALTER_60', 0.05871394023178372),
 ('D19_GARTEN', 0.06033387308908219),
 ('MOBI_RASTER', 0.06165517402166685),
 ('D19_BIO_OEKO', 0.06357673030125377),
 ('PLZ8_ANTG1', 0.0655676795152362),
 ('D19_LEBENSMITTEL', 0.06810765976738557),
 ('KBA13_AUTOQUOTE', 0.07143693115392924),
 ('Movement', 0.07181292249829302),
 ('D19_WEIN_FEINKOST', 0.07368289542555144),
 ('D19_GESAMT_ANZ_12', 0.07435029414202385),
 ('D19_VERSAND_ANZ_24', 0.08412227176420277),
 ('D19_BANKEN_GROSS', 0.08523875858287425),
 ('D19_BANKEN_REST', 0.08698218096412637),
 ('D19_BEKLEIDUNG_GEH', 0.09175758319978704),
 ('ANZ_PERSONEN', 0.09300803421102447),
 ('EINGEFUEGT_AM_day', 0.0931730146021071),
 ('D19_BILDUNG', 0.09369124816722794),
 ('D19_SAMMELARTIKEL', 0.09509200266486431),
 ('D19_SCHUHE', 0.0965101765239172),
 ('D19_GESAMT_ANZ_24', 0.10296639005586601),
 ('D19_RATGEBER', 0.10688774544608155),
 ('D19_VERSAND_ANZ_12', 0.11526783934765587),
 ('D19_TELKO_REST', 0.1196674957692957),
 ('D19_FREIZEIT', 0.12732846562952493),
 ('D19_DROGERIEARTIKEL', 0.13109663281962203),
 ('D19_HANDWERK', 0.13115897448782873),
 ('D19_BANKEN_DIREKT', 0.14002398373351468),
 ('D19_REISEN', 0.14028639553849825),
 ('D19_VERSAND_REST', 0.1436881171825256),
 ('D19_TELKO_MOBILE', 0.14915692974582506),
 ('D19_KOSMETIK', 0.15506691098118397),
 ('D19_BANKEN_ONLINE_QUOTE_12', 0.16367011077749938),
 ('D19_KINDERARTIKEL', 0.16410160255401163),
 ('EINGEFUEGT_AM_month', 0.1697148764068355),
 ('D19_BEKLEIDUNG_REST', 0.18251616709779092),
 ('D19_HAUS_DEKO', 0.19306488857260717),
 ('MIN_GEBAEUDEJAHR', 0.26411019717757017),
 ('EINGEFUEGT_AM_year', 0.33713539767438505)]

level of recent activity

In [337]:
getWeight(pca, 3, gen_pop_low_nan_encoded, False)
Out[337]:
[('D19_BANKEN_ONLINE_QUOTE_12', -0.1360572543662393),
 ('D19_KOSMETIK', -0.12468960562247824),
 ('D19_REISEN', -0.1146501132233671),
 ('D19_HAUS_DEKO', -0.11348543263775318),
 ('D19_BEKLEIDUNG_REST', -0.11299415342441017),
 ('D19_TELKO_MOBILE', -0.10614623984260145),
 ('D19_HANDWERK', -0.09994121473242457),
 ('D19_BANKEN_DIREKT', -0.09795742995893847),
 ('D19_KINDERARTIKEL', -0.09325664758870172),
 ('D19_VERSAND_REST', -0.08744040749514585),
 ('D19_SAMMELARTIKEL', -0.08424261904170294),
 ('D19_DROGERIEARTIKEL', -0.0789570915954452),
 ('D19_TELKO_REST', -0.0765791763385359),
 ('D19_FREIZEIT', -0.07054663548973765),
 ('D19_RATGEBER', -0.06746193080413145),
 ('D19_VERSAND_ANZ_12', -0.06460556868490057),
 ('D19_BILDUNG', -0.06358583632808822),
 ('D19_BANKEN_GROSS', -0.06187478548087799),
 ('ANZ_PERSONEN', -0.06150098606879085),
 ('D19_GESAMT_ANZ_24', -0.06022736356688617),
 ('Movement', -0.05944472029332886),
 ('D19_WEIN_FEINKOST', -0.059433105327462),
 ('D19_BEKLEIDUNG_GEH', -0.05909356864967489),
 ('D19_BANKEN_REST', -0.057870708755937826),
 ('D19_SOZIALES', -0.057648005744639644),
 ('D19_SCHUHE', -0.05268500443314216),
 ('D19_LOTTO', -0.04937553720051925),
 ('D19_BIO_OEKO', -0.047213442762242694),
 ('D19_VERSAND_ANZ_24', -0.04717519392123564),
 ('D19_LEBENSMITTEL', -0.04683692983439739),
 ('D19_GESAMT_ANZ_12', -0.04402299536099004),
 ('D19_NAHRUNGSERGAENZUNG', -0.04268871390506046),
 ('VHA', -0.0403453268644092),
 ('D19_BANKEN_ANZ_24', -0.03825342522505369),
 ('D19_ENERGIE', -0.03742327178705912),
 ('D19_GARTEN', -0.034040691559847944),
 ('D19_TIERARTIKEL', -0.03287772357889804),
 ('KBA13_SEG_SPORTWAGEN', -0.032647121294907606),
 ('KBA13_SITZE_4', -0.03207597443026079),
 ('KBA05_HERST1', -0.030299000587136426),
 ('KBA13_MERCEDES', -0.029444199335072464),
 ('KBA13_HERST_BMW_BENZ', -0.029024801201911147),
 ('D19_VERSAND_ONLINE_QUOTE_12', -0.027937750776468293),
 ('D19_VERSICHERUNGEN', -0.02751638133464426),
 ('LP_FAMILIE_GROB', -0.026972628100929076),
 ('LP_FAMILIE_FEIN', -0.026800506359740656),
 ('KBA13_HALTER_65', -0.026296824713321),
 ('KBA13_SEG_OBEREMITTELKLASSE', -0.026160513185089652),
 ('KBA05_KRSKLEIN', -0.025808008833673297),
 ('ONLINE_AFFINITAET', -0.025545514185661165),
 ('D19_BANKEN_ANZ_12', -0.025310346479023985),
 ('D19_DIGIT_SERV', -0.02530391645507526),
 ('D19_BUCH_CD', -0.024364824611705978),
 ('D19_VOLLSORTIMENT', -0.022637096476495665),
 ('KBA05_SEG5', -0.021378851585340985),
 ('D19_TECHNIK', -0.021369514949805244),
 ('ALTER_HH', -0.021023187639808722),
 ('D19_SONSTIGE', -0.021013870034094224),
 ('LP_LEBENSPHASE_FEIN_stat', -0.02051438143514038),
 ('D19_VERSI_ANZ_24', -0.020429948324666266),
 ('WOHNDAUER_2008', -0.020362908450836145),
 ('KBA13_BMW', -0.0191360913931941),
 ('VHN', -0.018784605614463656),
 ('D19_GESAMT_ONLINE_QUOTE_12', -0.01813791794102991),
 ('LP_LEBENSPHASE_GROB_stat', -0.01778871324973858),
 ('KBA13_KMH_211', -0.017607155386287515),
 ('STRUKTURTYP', -0.01759073314806363),
 ('ORTSGR_KLS9', -0.017572257061862294),
 ('KBA13_KMH_250', -0.017314008910337474),
 ('KBA13_SEG_MINIWAGEN', -0.015702714780796774),
 ('KBA05_MOTOR', -0.014971242401485835),
 ('KBA13_ANTG2', -0.014888551103943298),
 ('KBA13_SEG_WOHNMOBILE', -0.014178154584883206),
 ('GREEN_AVANTGARDE', -0.014143070206827963),
 ('sum', -0.01396622556029579),
 ('KBA05_SEG8', -0.013680863763223777),
 ('KBA13_FIAT', -0.013594386553539058),
 ('GEBAEUDETYP', -0.013409852083515498),
 ('KKK', -0.012814123855562397),
 ('KBA13_ANTG3', -0.012665957103414402),
 ('D19_TELKO_ANZ_24', -0.012642321868650329),
 ('KBA13_SEG_OBERKLASSE', -0.01236621370963543),
 ('PLZ8_ANTG2', -0.012287217886913055),
 ('FINANZ_HAUSBAUER', -0.01225263602619973),
 ('KBA05_ALTER4', -0.012124881396854546),
 ('KBA05_MOD1', -0.01208156478822664),
 ('D19_VERSI_ANZ_12', -0.012076286083850208),
 ('D19_BANKEN_LOKAL', -0.01180555385222454),
 ('GEBURTSJAHR', -0.011733492989202943),
 ('KBA13_ALTERHALTER_61', -0.011468664048308782),
 ('KBA05_KW3', -0.011466980581466401),
 ('EWDICHTE', -0.011120985599724664),
 ('DSL_FLAG', -0.010763133583956683),
 ('KBA13_CCM_3000', -0.010667027943121389),
 ('KBA05_CCM4', -0.010250283311482655),
 ('KBA13_HALTER_66', -0.01001320595145045),
 ('SEMIO_LUST', -0.009993697476915721),
 ('LNR', -0.009970081524553647),
 ('KBA13_CCM_2501', -0.009508226812410583),
 ('REGIOTYP', -0.008826624130439647),
 ('KBA05_KRSAQUOT', -0.008680101191511407),
 ('KBA05_MOTRAD', -0.00833719440778088),
 ('KBA13_HHZ', -0.008030112439682303),
 ('PLZ8_ANTG3', -0.008004606766703918),
 ('VERDICHTUNGSRAUM', -0.007919901125953423),
 ('KBA05_ANTG1', -0.007916308407043113),
 ('KBA13_VORB_0', -0.0079025739292216),
 ('KBA13_KW_121', -0.007888852795413823),
 ('KBA05_MAXAH', -0.007830933994679086),
 ('RELAT_AB', -0.007620155973451552),
 ('KBA05_VORB0', -0.007434159414067246),
 ('MOBI_REGIO', -0.007224192057755141),
 ('SEMIO_SOZ', -0.007193926461541938),
 ('D19_TELKO_ANZ_12', -0.007069269542319745),
 ('VERS_TYP', -0.006829916291535412),
 ('KBA05_CCM3', -0.006792141196249034),
 ('PLZ8_HHZ', -0.006684154209633675),
 ('SEMIO_DOM', -0.006675615441163772),
 ('KBA13_KW_110', -0.006643134406067799),
 ('KBA13_ANTG4', -0.006642663254407505),
 ('KBA05_SEG10', -0.006489795604149966),
 ('KBA05_ALTER3', -0.006178059413774288),
 ('KBA13_BAUMAX', -0.00606898280427385),
 ('ANZ_HAUSHALTE_AKTIV', -0.005891653260615845),
 ('ALTERSKATEGORIE_FEIN', -0.005873993578183871),
 ('CJT_GESAMTTYP', -0.005597755351800396),
 ('med', -0.0054550931085140635),
 ('KBA05_ANHANG', -0.005363181411561227),
 ('KBA13_BJ_2008', -0.005347064083256076),
 ('KBA13_CCM_2500', -0.005310621054496862),
 ('LP_LEBENSPHASE_FEIN_wealth', -0.005249789293830181),
 ('KBA05_SEG6', -0.00520775329362445),
 ('KBA05_KRSHERST1', -0.0051842427253418095),
 ('KBA05_KRSOBER', -0.004970101297803656),
 ('ANZ_STATISTISCHE_HAUSHALTE', -0.004827034919139632),
 ('KBA05_AUTOQUOT', -0.004772986787024622),
 ('SEMIO_KULT', -0.004622673147441745),
 ('KBA05_HERST2', -0.004396528763918327),
 ('FINANZ_VORSORGER', -0.004342652113398382),
 ('KBA13_SITZE_6', -0.004279497805742186),
 ('FINANZTYP', -0.004262988235379337),
 ('KBA05_DIESEL', -0.00408593430717054),
 ('ANZ_KINDER', -0.004024145959627971),
 ('KONSUMZELLE', -0.003995198143593117),
 ('KBA05_ZUL4', -0.003938323888975254),
 ('KBA13_KRSHERST_BMW_BENZ', -0.003920554739626497),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT', -0.0038660077590375345),
 ('KBA05_MAXSEG', -0.0038554951826411644),
 ('skew', -0.0038177989081585473),
 ('KBA13_CCM_2000', -0.0038113662355296384),
 ('KBA05_MOD8', -0.0037137110556610072),
 ('PLZ8_ANTG4', -0.00369492073759949),
 ('KBA05_KRSZUL', -0.003687221768413195),
 ('KBA13_SEG_GELAENDEWAGEN', -0.003675803529549088),
 ('HH_DELTA_FLAG', -0.003524262337160412),
 ('KBA05_SEG7', -0.003465614251464848),
 ('KBA05_ZUL3', -0.003355393105345738),
 ('KBA05_SEG1', -0.002981537654806005),
 ('life_stage', -0.002919804826765564),
 ('KBA13_KRSZUL_NEU', -0.002809469729018373),
 ('KBA13_SEG_GROSSRAUMVANS', -0.0027985767632150785),
 ('kurt', -0.002604229861999246),
 ('KBA13_CCM_3001', -0.002581235923930028),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD', -0.0025096334497911187),
 ('KBA05_MOD4', -0.002411872251568685),
 ('KBA05_GBZ', -0.00234855514551521),
 ('KBA05_KRSHERST2', -0.002287923285768297),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN', -0.002275087129447913),
 ('KBA05_SEG9', -0.002257932605204981),
 ('KBA13_KMH_251', -0.002245353990318723),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE', -0.002240137211723571),
 ('LP_LEBENSPHASE_GROB_wealth', -0.002147981043722869),
 ('KBA05_FRAU', -0.0021049590392222027),
 ('D19_VERSI_ONLINE_QUOTE_12', -0.002049169896201042),
 ('SEMIO_PFLICHT', -0.0020460347694433965),
 ('SEMIO_KAEM', -0.0018414826969513065),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO', -0.0018113997578241981),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE', -0.001775086303927192),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL', -0.0017542923268153425),
 ('KBA13_KW_120', -0.0017146561816239433),
 ('HEALTH_TYP', -0.0016741457024406993),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT', -0.0016475062951739243),
 ('KBA13_HALTER_40', -0.0016226657657966302),
 ('D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST', -0.0015798199635223826),
 ('LP_STATUS_GROB', -0.001552733328320327),
 ('PLZ8_ANTG1', -0.00155222321361103),
 ('KBA05_MAXBJ', -0.001457984698799958),
 ('CAMEO_DEU_2015_2D', -0.0013801005087792132),
 ('CAMEO_DEUG_2015_adj', -0.0013634179381406754),
 ('KBA13_SEG_UTILITIES', -0.0010952031399241915),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS', -0.0010929115766395816),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST', -0.0010872802171830592),
 ('CAMEO_DEU_2015_1D', -0.0009663852493387639),
 ('CAMEO_DEU_2015_8A', -0.0009553218062629779),
 ('CAMEO_DEU_2015_6B', -0.0009436093036978722),
 ('ANZ_HH_TITEL', -0.0009294191558863813),
 ('UMFELD_JUNG', -0.0008704271614232906),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH', -0.0008653777403647559),
 ('FINANZ_MINIMALIST', -0.0008642145209480331),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE', -0.0008416311316222065),
 ('KBA05_KRSVAN', -0.00082930480825159),
 ('KBA05_MOD2', -0.0008121826808558579),
 ('HH_EINKOMMEN_SCORE', -0.0008027316094391837),
 ('CAMEO_DEU_2015_5D', -0.0007843040285311355),
 ('KBA13_AUDI', -0.0007693178972406135),
 ('KBA05_VORB1', -0.0007433841632972052),
 ('KBA13_KRSSEG_OBER', -0.0007196367501666464),
 ('KBA13_HALTER_45', -0.000711467724834275),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST', -0.000661931645337235),
 ('D19_TELKO_ONLINE_QUOTE_12', -0.0006486844582071072),
 ('CAMEO_DEU_2015_8B', -0.000629846301948596),
 ('KBA13_PEUGEOT', -0.0005990996577600003),
 ('CAMEO_DEU_2015_3C', -0.0005890716081151566),
 ('KBA05_ZUL1', -0.0005780220017510941),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL', -0.0005399132799565302),
 ('KBA05_ANTG2', -0.0005333866412165785),
 ('D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE', -0.000527242002696821),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL', -0.0004312453980171698),
 ('CAMEO_DEU_2015_1A', -0.0003976080386172134),
 ('CAMEO_DEU_2015_1E', -0.00039241693679972496),
 ('CAMEO_DEU_2015_9D', -0.0003805862405198943),
 ('CAMEO_DEU_2015_6C', -0.0003690855188808431),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST', -0.0003478945701996092),
 ('CAMEO_DEU_2015_9A', -0.0003252163977901342),
 ('D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT', -0.0003242378097616384),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK', -0.00032373974864386725),
 ('KBA05_KW2', -0.00031680882230943834),
 ('KBA13_SEG_VAN', -0.0003110797469844788),
 ('D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG', -0.00029941214692528605),
 ('CAMEO_DEU_2015_4D', -0.0002844222823604992),
 ('KBA13_BJ_2009', -0.0002654522082010773),
 ('D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER', -0.00024671916127290183),
 ('KBA13_VORB_1', -0.00024340223105447317),
 ('CAMEO_DEU_2015_6E', -0.00024179138378243247),
 ('D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV', -0.0002389968918349544),
 ('CAMEO_DEU_2015_9C', -0.00022117395275989525),
 ('CAMEO_DEU_2015_8C', -0.00022092018248261704),
 ('KBA13_BJ_2006', -0.00020379572676683952),
 ('CAMEO_DEU_2015_1C', -0.0002021758838888528),
 ('D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL', -0.00019805676600990446),
 ('UNGLEICHENN_FLAG', -0.00018854209578901356),
 ('D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST', -0.00018630080503322913),
 ('CAMEO_DEU_2015_3D', -0.00018164032367720502),
 ('ANZ_TITEL', -0.000174695868962453),
 ('CAMEO_DEU_2015_1B', -0.000174323503538025),
 ('CAMEO_DEU_2015_4E', -0.00016079677255973992),
 ('CAMEO_DEU_2015_5F', -0.0001581691854913574),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL', -0.0001531719665363162),
 ('D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL', -0.00014508723362047113),
 ('ANREDE_KZ', -0.00014413365856638194),
 ('CAMEO_DEU_2015_7C', -0.00013716101963726294),
 ('D19_LETZTER_KAUF_BRANCHE_D19_REISEN', -0.00013390763058474843),
 ('CAMEO_DEU_2015_7D', -0.00012607052100555325),
 ('SEMIO_REL', -0.00011879281274191458),
 ('D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK', -0.00011082506104012018),
 ('CAMEO_DEU_2015_5E', -0.00010317897266798685),
 ('CAMEO_DEU_2015_6D', -9.670143020590174e-05),
 ('SOHO_KZ', -9.433593306125686e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO', -8.831794404124416e-05),
 ('PLZ8_BAUMAX_adj', -8.117203186954151e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_GARTEN', -7.743633108044147e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG', -6.987516848159583e-05),
 ('TITEL_KZ', -6.702251092891525e-05),
 ('SEMIO_ERL', -6.438072008178718e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK', -6.21754574846553e-05),
 ('CAMEO_DEU_2015_5B', -5.331279802523244e-05),
 ('KBA13_ALTERHALTER_45', -4.433133355890086e-05),
 ('D19_LETZTER_KAUF_BRANCHE_D19_LOTTO', -2.9822205115819905e-05),
 ('NATIONALITAET_KZ', -2.1453660901613096e-05),
 ('min', 0.0),
 ('KBA13_HERST_EUROPA', 1.7317456212733783e-05),
 ('CAMEO_DEU_2015_7B', 1.8359454255383666e-05),
 ('CAMEO_DEU_2015_7E', 2.8819281223065722e-05),
 ('CAMEO_DEU_2015_6F', 3.096894310148686e-05),
 ('CAMEO_DEU_2015_6A', 3.369143435541201e-05),
 ('KBA05_SEG4', 3.538185854131872e-05),
 ('CAMEO_DEU_2015_2C', 4.992415616909724e-05),
 ('CAMEO_DEU_2015_4C', 0.00011183861628918275),
 ('LP_STATUS_FEIN', 0.0001154348113489001),
 ('CAMEO_DEU_2015_7A', 0.0001526995384238623),
 ('CAMEO_DEU_2015_4B', 0.00016445620556429063),
 ('CAMEO_DEU_2015_8D', 0.0001766841488481503),
 ('CJT_TYP_3', 0.00021105064157313864),
 ('FINANZ_SPARER', 0.00025681321390224785),
 ('KBA13_KRSHERST_AUDI_VW', 0.0002699495381152764),
 ('CAMEO_DEU_2015_9E', 0.0002926165926480572),
 ('ARBEIT', 0.0002985242142995652),
 ('KBA13_KRSSEG_KLEIN', 0.00030045960361453736),
 ('KBA05_BAUMAX', 0.000328126701660154),
 ('KBA13_ANTG1', 0.00032917725693000434),
 ('Decade', 0.000346661382544774),
 ('KBA13_HERST_AUDI_VW', 0.00034924855499608433),
 ('CAMEO_DEU_2015_5C', 0.00036532775534468863),
 ('KBA13_KW_90', 0.0003764429799148489),
 ('CAMEO_DEU_2015_9B', 0.00041249217546892675),
 ('ALTERSKATEGORIE_GROB', 0.00041547281412730085),
 ('CAMEO_DEU_2015_2B', 0.00047479122701430873),
 ('KBA13_VW', 0.0005037178219467834),
 ('KBA05_MOD3', 0.0005075735774097188),
 ('KBA05_ZUL2', 0.0005165933709232166),
 ('CAMEO_DEU_2015_4A', 0.0005615466768760118),
 ('KBA13_BJ_2004', 0.0005878013822266194),
 ('PLZ8_GBZ', 0.0006230320851061155),
 ('KBA13_SEG_SONSTIGE', 0.0007397928467103698),
 ('KBA13_KMH_210', 0.0007993938190055684),
 ('KBA13_HALTER_35', 0.0008161276834676507),
 ('CAMEO_DEU_2015_3B', 0.0009805951667032538),
 ('CAMEO_DEU_2015_2A', 0.0009846954892691738),
 ('CJT_TYP_1', 0.001020687265916386),
 ('KBA05_SEG2', 0.0010516749645309235),
 ('KBA13_MOTOR', 0.0010669202821937424),
 ('KBA13_CCM_1800', 0.001080076536268068),
 ('KBA05_KRSHERST3', 0.0011265078350340507),
 ('WOHNLAGE_adj', 0.0012149397610570103),
 ('SEMIO_TRADV', 0.0014134935698488782),
 ('KBA05_HERST3', 0.0014875593482144601),
 ('KBA13_KW_61_120', 0.0015276242385700578),
 ('CAMEO_DEU_2015_5A', 0.001555035363647475),
 ('KBA13_VORB_2', 0.0016083401801781018),
 ('wealth', 0.0016172391512443533),
 ('KBA13_ANZAHL_PKW', 0.001689943319331369),
 ('GFK_URLAUBERTYP', 0.0016945905452013137),
 ('D19_TELKO_ONLINE_DATUM', 0.0016960570509593296),
 ('KBA13_KRSSEG_VAN', 0.0018423353768457981),
 ('KBA13_KMH_140', 0.0019473129458609994),
 ('KBA13_BJ_2000', 0.0019793391475207223),
 ('KBA13_BJ_1999', 0.0019912975353289252),
 ('KBA13_KRSHERST_FORD_OPEL', 0.002128242204437878),
 ('KBA13_CCM_0_1400', 0.002323231758634776),
 ('KBA13_KW_60', 0.0023283830031453237),
 ('KOMBIALTER', 0.002370357180074882),
 ('KBA13_SEG_MINIVANS', 0.002377689365428893),
 ('SEMIO_RAT', 0.0023988385004498967),
 ('KBA13_SEG_KLEINST', 0.002501519367136103),
 ('KBA05_ANTG4', 0.0026530105846132956),
 ('CJT_TYP_6', 0.0026941536259823323),
 ('KBA13_KRSAQUOT', 0.0027476433457834567),
 ('KBA13_VORB_3', 0.002749135896258886),
 ('KBA13_MAZDA', 0.002879324927112154),
 ('KBA13_TOYOTA', 0.002900846997871024),
 ('D19_VERSI_ONLINE_DATUM', 0.0029364970332151404),
 ('KBA13_KW_50', 0.0029881152316524767),
 ('SEMIO_VERT', 0.0030093927957294585),
 ('KBA13_KW_0_60', 0.003052744259839958),
 ('KBA13_KMH_0_140', 0.003058602859695357),
 ('KBA13_CCM_1000', 0.00307228087827824),
 ('KBA13_VORB_1_2', 0.0031003231678365176),
 ('LP_LEBENSPHASE_FEIN_age', 0.0031009562647282495),
 ('KBA13_CCM_1400', 0.0031160939606148484),
 ('KBA13_CCM_1600', 0.003176674175293427),
 ('SEMIO_FAM', 0.0032029935548459638),
 ('RT_UEBERGROESSE', 0.0032326176341355194),
 ('KBA13_SEG_MITTELKLASSE', 0.003344833663164587),
 ('CAMEO_DEU_2015_3A', 0.003488380246724438),
 ('KBA13_FORD', 0.003574865114528242),
 ('KBA13_KW_40', 0.0036071913921740008),
 ('KBA05_CCM2', 0.0037254469848499562),
 ('KBA05_CCM1', 0.0037290105103529926),
 ('KBA05_KW1', 0.0037450601500614877),
 ('KBA13_OPEL', 0.003747153104545704),
 ('SHOPPER_TYP', 0.0037574739146581456),
 ('KBA13_GBZ', 0.0038092951852269144),
 ('KBA13_KMH_180', 0.003865471219111404),
 ('KBA13_CCM_1200', 0.003941595072363085),
 ('KBA13_RENAULT', 0.0040252727390845955),
 ('KBA13_HERST_FORD_OPEL', 0.004132177942760394),
 ('CJT_TYP_4', 0.004188900147773097),
 ('CJT_TYP_2', 0.004203924881346133),
 ('KBA05_VORB2', 0.004218923423789527),
 ('KBA05_HERST4', 0.004224358015331877),
 ('KBA13_KW_80', 0.004399814839395989),
 ('KBA13_HALTER_30', 0.004510977967246021),
 ('KBA13_NISSAN', 0.0045454695595311114),
 ('KBA13_FAB_SONSTIGE', 0.004734526208668013),
 ('KBA13_HERST_SONST', 0.004734526208668013),
 ('KBA05_MAXVORB', 0.004912707864784536),
 ('KBA13_CCM_1500', 0.005083577197728765),
 ('RT_KEIN_ANREIZ', 0.005200237613096273),
 ('FINANZ_ANLEGER', 0.005301374649980892),
 ('RETOURTYP_BK_S', 0.005455439886107851),
 ('KBA13_KMH_110', 0.005595820995546436),
 ('KBA13_HALTER_60', 0.00565021156418963),
 ('KBA13_KW_70', 0.005800112003892735),
 ('UMFELD_ALT', 0.006070699484465619),
 ('KBA05_HERST5', 0.006102792271514381),
 ('FINANZ_UNAUFFAELLIGER', 0.006119241816539557),
 ('KBA13_KW_30', 0.006250821655998154),
 ('KBA05_SEG3', 0.0063359853127677845),
 ('KBA05_ALTER2', 0.006586212258550472),
 ('CJT_TYP_5', 0.006883619143004804),
 ('KBA13_CCM_1401_2500', 0.006959390746388643),
 ('KBA13_SEG_KLEINWAGEN', 0.006972492179420352),
 ('GEBAEUDETYP_RASTER', 0.00747189340983367),
 ('std', 0.007561478892710939),
 ('mean', 0.0076734018479684),
 ('KBA05_ALTER1', 0.007980546683330944),
 ('W_KEIT_KIND_HH', 0.008223779813261724),
 ('KBA13_SEG_KOMPAKTKLASSE', 0.00852202107847543),
 ('KBA13_ALTERHALTER_30', 0.008541230936694066),
 ('KBA13_FAB_ASIEN', 0.009355078655176355),
 ('KBA13_HERST_ASIEN', 0.009399565188320724),
 ('KBA05_ANTG3', 0.009405323125031157),
 ('SEMIO_KRIT', 0.00998404542991313),
 ('SEMIO_MAT', 0.010241533692722326),
 ('BALLRAUM', 0.010549816764204092),
 ('OST_WEST_KZ', 0.010580335637619691),
 ('INNENSTADT', 0.010621715252990323),
 ('MOBI_RASTER', 0.01069729927276762),
 ('D19_VERSI_OFFLINE_DATUM', 0.010786120653586544),
 ('KBA13_HALTER_25', 0.011438176547538322),
 ('D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT', 0.011868409342715401),
 ('KBA13_KMH_140_210', 0.01326875592798177),
 ('RT_SCHNAEPPCHEN', 0.01422124323027177),
 ('AKT_DAT_KL', 0.014571015136613328),
 ('KONSUMNAEHE', 0.015002493343485855),
 ('GEMEINDETYP', 0.015121635636782038),
 ('CJT_KATALOGNUTZER', 0.01538623075572024),
 ('D19_BANKEN_OFFLINE_DATUM', 0.015401428111378707),
 ('KBA13_HALTER_55', 0.016053057312189964),
 ('KBA13_AUTOQUOTE', 0.016425116126878844),
 ('KBA05_MAXHERST', 0.016845101496294587),
 ('FIRMENDICHTE', 0.017097626400968464),
 ('EINGEZOGENAM_HH_JAHR', 0.01786709533984085),
 ('KBA13_HALTER_20', 0.018715116506489313),
 ('KBA13_ALTERHALTER_60', 0.01965381711014668),
 ('KBA13_HALTER_50', 0.02018190728209536),
 ('max', 0.022753067077421776),
 ('D19_TELKO_OFFLINE_DATUM', 0.023929974875610854),
 ('D19_KONSUMTYP', 0.02479773423154472),
 ('ZABEOTYP', 0.025755128255244484),
 ('KBA13_SITZE_5', 0.029021151288683635),
 ('D19_KONSUMTYP_MAX', 0.03116358354452083),
 ('D19_VERSAND_ONLINE_DATUM', 0.03316975280490815),
 ('VK_DHT4A', 0.03382843503757589),
 ('D19_GESAMT_DATUM', 0.03466414536356653),
 ('KK_KUNDENTYP_mod', 0.03585240870583431),
 ('D19_VERSAND_DATUM', 0.036271860272764606),
 ('VK_ZG11', 0.036569366177832595),
 ('D19_GESAMT_ONLINE_DATUM', 0.03778569840734744),
 ('VK_DISTANZ', 0.03947984815892236),
 ('KBA05_MODTEMP', 0.041272080042777146),
 ('EINGEFUEGT_AM_weekday', 0.0431424196948172),
 ('D19_TELKO_DATUM', 0.06074083468525588),
 ('D19_VERSI_DATUM', 0.07312822195952352),
 ('D19_VERSAND_OFFLINE_DATUM', 0.0793061176299923),
 ('D19_BANKEN_ONLINE_DATUM', 0.09101616223455115),
 ('KBA05_HERSTTEMP', 0.09620127741592065),
 ('D19_GESAMT_OFFLINE_DATUM', 0.10371006485083259),
 ('D19_BANKEN_DATUM', 0.10988183638657545),
 ('EINGEFUEGT_AM_day', 0.1626248116917948),
 ('EINGEFUEGT_AM_month', 0.3075544947618267),
 ('MIN_GEBAEUDEJAHR', 0.45804948445661964),
 ('EINGEFUEGT_AM_year', 0.596026104214265)]

offline activity

In [ ]:
score = []
for i in range(2,50):
    kmean_init = KMeans(n_clusters=i, random_state=42)
    model = kmean_init.fit(X_pca)
    score.append(abs(model.score(X_pca)))
    print("\nDone with i:",i,"The score is:",score[i-2])
    
plt.plot(list(range(2,50)), score)
plt.xlabel('K')
plt.ylabel('SSE')
Done with i: 2 The score is: 322575926.1651047

Done with i: 3 The score is: 313364868.9455074

Done with i: 4 The score is: 294083124.38877946

Done with i: 5 The score is: 289608968.5943805

Done with i: 6 The score is: 285799198.03568727

Done with i: 7 The score is: 282779062.51599026

Done with i: 8 The score is: 280271214.7924869

Done with i: 9 The score is: 278013506.2584997

Done with i: 10 The score is: 276143756.39900583

Done with i: 11 The score is: 274601071.3823531

Done with i: 12 The score is: 273162607.5241778

Done with i: 13 The score is: 271971142.77329594

Done with i: 14 The score is: 270958796.31468433

Done with i: 15 The score is: 270032405.01610416

Done with i: 16 The score is: 269158948.1825118

Done with i: 17 The score is: 267972887.09040833

Done with i: 18 The score is: 267136341.61606655

Done with i: 19 The score is: 266698935.31224236

Done with i: 20 The score is: 266108016.88404578

Done with i: 21 The score is: 265039423.27700454

Done with i: 22 The score is: 264688265.60141635

Done with i: 23 The score is: 264051640.00621733

Done with i: 24 The score is: 262881819.14802387

Done with i: 25 The score is: 262337590.6471799

Done with i: 26 The score is: 262184419.8864755

Done with i: 27 The score is: 261115804.98523906

Done with i: 28 The score is: 260998236.72914648

Done with i: 29 The score is: 260057365.04721287

Done with i: 30 The score is: 259490030.89395157

Done with i: 31 The score is: 258799883.74145913

Done with i: 32 The score is: 258565547.29967064

Done with i: 33 The score is: 258335945.05115688
In [2]:
t= """Done with i: 2 The score is: 322575926.1651047

Done with i: 3 The score is: 313364868.9455074

Done with i: 4 The score is: 294083124.38877946

Done with i: 5 The score is: 289608968.5943805

Done with i: 6 The score is: 285799198.03568727

Done with i: 7 The score is: 282779062.51599026

Done with i: 8 The score is: 280271214.7924869

Done with i: 9 The score is: 278013506.2584997

Done with i: 10 The score is: 276143756.39900583

Done with i: 11 The score is: 274601071.3823531

Done with i: 12 The score is: 273162607.5241778

Done with i: 13 The score is: 271971142.77329594

Done with i: 14 The score is: 270958796.31468433

Done with i: 15 The score is: 270032405.01610416

Done with i: 16 The score is: 269158948.1825118

Done with i: 17 The score is: 267972887.09040833

Done with i: 18 The score is: 267136341.61606655

Done with i: 19 The score is: 266698935.31224236

Done with i: 20 The score is: 266108016.88404578

Done with i: 21 The score is: 265039423.27700454

Done with i: 22 The score is: 264688265.60141635

Done with i: 23 The score is: 264051640.00621733

Done with i: 24 The score is: 262881819.14802387

Done with i: 25 The score is: 262337590.6471799

Done with i: 26 The score is: 262184419.8864755

Done with i: 27 The score is: 261115804.98523906

Done with i: 28 The score is: 260998236.72914648

Done with i: 29 The score is: 260057365.04721287

Done with i: 30 The score is: 259490030.89395157

Done with i: 31 The score is: 258799883.74145913

Done with i: 32 The score is: 258565547.29967064

Done with i: 33 The score is: 258335945.05115688"""
In [6]:
score = re.findall("\d+\.\d+", t)
In [26]:
plt.plot(list(range(2,34)), [float(val) for val in score])
plt.xlabel('K')
plt.ylabel('SSE')
Out[26]:
Text(0, 0.5, 'SSE')
In [17]:
k = 10
In [18]:
kmeans = KMeans(n_clusters= k, random_state=42)
final_model = kmeans.fit(X_pca)
In [19]:
gen_population_predict = final_model.predict(X_pca)
In [206]:
def clean_data(df, no_del=True):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame, no_del: flag to cancel row elimination
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # Put in code here to execute all main cleaning steps:
    # Get rid of Unnamed 0 col
    if 'Unnamed: 0' in df.columns:
        df.drop(['Unnamed: 0'], axis=1, inplace= True)
    
    # convert missing value codes into NaNs, ...
    print("initial", df.shape)
    _, df = findNans(df, missing_values)
    print("after findNans", df.shape)
    
    # extrapolate high NaN but high correlation features
    regFillApply(df, 'KK_KUNDENTYP', 'D19_GESAMT_DATUM')
    
    # remove selected columns and rows, ...
    to_be_removed = ['AGER_TYP', 'ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4',
       'EXTSEL992', 'KK_KUNDENTYP', 'KK_KUNDENTYP_corr']
    
    df.drop(to_be_removed, axis=1, inplace=True)
    
    if no_del:
        df['percent_NaN'] = df.isnull().mean(axis=1) 
        df = df.loc[df['percent_NaN'] <= 0.3, :]
        print("after removing some rows", df.shape)
        df.drop(['percent_NaN'], axis=1, inplace=True)
        print("after removing percent_NaN", df.shape)

    
    # select, re-encode, and engineer column values.
    try:
        df.OST_WEST_KZ.replace({"W": 0, "O": 1}, inplace=True)
    except:
        pass
    
    print("after OST_WEST_KZ", df.shape)
    
    df['EINGEFUEGT_AM'] = pd.to_datetime(df['EINGEFUEGT_AM'])
    df['EINGEFUEGT_AM_weekday'] = df['EINGEFUEGT_AM'].apply(lambda x: x.weekday())
    df['EINGEFUEGT_AM_day'] = df['EINGEFUEGT_AM'].apply(lambda x: x.day)
    df['EINGEFUEGT_AM_month'] = df['EINGEFUEGT_AM'].apply(lambda x: x.month)
    df['EINGEFUEGT_AM_year'] = df['EINGEFUEGT_AM'].apply(lambda x: x.year)
    
    print("after engineering EINGEFUEGT Features")
    
    df['CAMEO_INTL_2015_adj'] = df['CAMEO_INTL_2015'].apply(lambda x: int(x) if x == x else np.nan)
    if 'CAMEO_INTL_2015_adj' in df.columns:    
        df["wealth"] = getWealth(df["CAMEO_INTL_2015"])
        df["life_stage"] = getLifeStage(df['CAMEO_INTL_2015'])  
    
    print("after engineering wealth, life_stage")
    
    df['CAMEO_DEUG_2015_adj'] = df['CAMEO_DEUG_2015'].apply(lambda x: int(x) if x == x else np.nan)

    if 'PRAEGENDE_JUGENDJAHRE' in df.columns:
        df['Movement'] = getMovement(df['PRAEGENDE_JUGENDJAHRE'])
        df['Decade']= getDecade(df['PRAEGENDE_JUGENDJAHRE'])
        
    print("after engineering Movement and Decade")
    
    df['LP_LEBENSPHASE_FEIN_wealth'], df['LP_LEBENSPHASE_FEIN_stat'], df['LP_LEBENSPHASE_FEIN_age'] = wealth_soc_stat_age(df.LP_LEBENSPHASE_FEIN)
    df['LP_LEBENSPHASE_GROB_wealth'], df['LP_LEBENSPHASE_GROB_stat'] = wealth_soc_stat_LP_LEBENSPHASE_GROB(df.LP_LEBENSPHASE_GROB)    
    
    print("after engineering FEIN_wealth,stat,age and GROB_wealth,stat")
    
    df['WOHNLAGE_adj'] = df.WOHNLAGE.apply(lambda x: x if x < 7 else np.nan)
    df['PLZ8_BAUMAX_adj'] = df.PLZ8_BAUMAX.apply(lambda x: x if x < 5 else np.nan)
    
    df.drop(['EINGEFUEGT_AM','CAMEO_INTL_2015', 'CAMEO_INTL_2015_adj','CAMEO_DEUG_2015', 'PRAEGENDE_JUGENDJAHRE','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','WOHNLAGE','PLZ8_BAUMAX'], axis= 1, inplace= True)

        
                
    print("before hot encoding", df.shape)
    df = pd.get_dummies(df, columns = ['D19_LETZTER_KAUF_BRANCHE','CAMEO_DEU_2015'])
    print("after hot encoding", df.shape)
    
    idx = df.columns.values[1:]
    df['sum'] = df[idx].sum(axis=1)  
    df['min'] = df[idx].min(axis=1)
    df['max'] = df[idx].max(axis=1)
    df['mean'] = df[idx].mean(axis=1)
    df['std'] = df[idx].std(axis=1)
    df['skew'] = df[idx].skew(axis=1)
    df['kurt'] = df[idx].kurtosis(axis=1)
    df['med'] = df[idx].median(axis=1)
    
    print("after engineering sum, min, max, etc features")

    return df
In [21]:
customers_processed = clean_data(customers)
initial (191652, 369)
after findNans (191652, 369)
r2= 0.6909194026562073
MSE= 0.8150410838612557
after removing some rows (140371, 364)
after removing percent_NaN (140371, 363)
after OST_WEST_KZ (140371, 363)
after engineering EINGEFUEGT Features
after engineering wealth, life_stage
after engineering Movement and Decade
after engineering FEIN_wealth,stat,age and GROB_wealth,stat
before hot encoding (140371, 371)
after hot encoding (140371, 448)
after engineering sum, min, max, etc features
In [22]:
customers_processed.drop(['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], axis= 1, inplace= True)
In [143]:
customers_imp = imp.transform(customers_processed)
customers_scaled = scaler.transform(customers_imp)
customers_pca = pca.transform(customers_scaled)
customers_clusters = final_model.predict(customers_pca)
In [24]:
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
prop_gen_pop = []
prop_cust_pop = []

counts = np.unique(customers_clusters, return_counts=True)[1]
for i in range(10):
    prop_cust_pop.append(counts[i]/len(customers_clusters))
    
counts = np.unique(gen_population_predict, return_counts=True)[1]
for i in range(10):
    prop_gen_pop.append(counts[i]/len(gen_population_predict))
    
temp_df = {'cluster': list(range(10)), 'prop_cust': prop_cust_pop, 'prop_gen': prop_gen_pop}    
prop_df = pd.DataFrame(temp_df)


prop_df.plot(x='cluster', y = ['prop_cust', 'prop_gen'], kind='bar')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22765e80>
In [25]:
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
over_rep_inv_pca = pca.inverse_transform(customers_pca[np.where(customers_clusters==0)])
over_rep = scaler.inverse_transform(over_rep_inv_pca).round()
over_rep_df = pd.DataFrame(data= over_rep, columns=customers_processed.columns)
over_rep_df = over_rep_df.applymap(lambda x: x if x>0 else 0)

over_rep_df.mean().sort_values()

gen_pop_clust_inv_pca = pca.inverse_transform(X_pca[np.where(gen_population_predict!=0)])
gen_pop_clust = scaler.inverse_transform(gen_pop_clust_inv_pca).round()
gen_pop_clust_df = pd.DataFrame(data= gen_pop_clust, columns=customers_processed.columns)
gen_pop_clust_df = gen_pop_clust_df.applymap(lambda x: x if x>0 else 0)
In [31]:
sort_diff = (over_rep_df.mean() - gen_pop_clust_df.mean()).sort_values()
In [32]:
sort_diff
Out[32]:
LNR                           -27167.570079
max                               -7.961723
ANZ_HAUSHALTE_AKTIV               -6.369678
EINGEZOGENAM_HH_JAHR              -6.101864
ANZ_STATISTISCHE_HAUSHALTE        -5.721677
kurt                              -5.118226
VK_DISTANZ                        -4.881716
VK_ZG11                           -4.413949
D19_VERSAND_DATUM                 -3.793515
D19_KONSUMTYP_MAX                 -3.764235
D19_GESAMT_DATUM                  -3.615585
VK_DHT4A                          -3.429383
AKT_DAT_KL                        -3.414888
KK_KUNDENTYP_mod                  -3.107894
D19_VERSAND_ONLINE_DATUM          -3.018667
CAMEO_DEU_2015_9D                 -2.906263
D19_GESAMT_ONLINE_DATUM           -2.849660
D19_GESAMT_OFFLINE_DATUM          -2.815990
D19_KONSUMTYP                     -2.695517
ALTERSKATEGORIE_FEIN              -2.603328
D19_VERSAND_OFFLINE_DATUM         -2.550996
CAMEO_DEUG_2015_adj               -2.066631
GFK_URLAUBERTYP                   -1.681232
CAMEO_DEU_2015_9E                 -1.564572
ZABEOTYP                          -1.486317
FINANZ_SPARER                     -1.284967
HH_EINKOMMEN_SCORE                -1.207608
KBA13_BAUMAX                      -1.201115
CJT_TYP_2                         -1.183542
CJT_TYP_1                         -1.162313
                                   ...     
D19_FREIZEIT                       1.345376
D19_GESAMT_ONLINE_QUOTE_12         1.402797
CJT_TYP_3                          1.482374
LP_FAMILIE_GROB                    1.677648
D19_WEIN_FEINKOST                  1.795062
D19_VERSAND_ANZ_24                 1.859477
D19_VERSAND_ONLINE_QUOTE_12        1.887179
D19_BEKLEIDUNG_GEH                 1.924551
D19_GESAMT_ANZ_24                  1.935737
D19_RATGEBER                       1.947087
LP_STATUS_FEIN                     2.027982
D19_KINDERARTIKEL                  2.092883
D19_BILDUNG                        2.284275
D19_LOTTO                          2.427797
D19_BEKLEIDUNG_REST                2.509036
D19_VOLLSORTIMENT                  2.837767
std                                2.871952
D19_HANDWERK                       2.877037
D19_HAUS_DEKO                      2.877448
D19_SAMMELARTIKEL                  2.878576
D19_TECHNIK                        2.890104
D19_BUCH_CD                        3.056259
D19_KOSMETIK                       3.176849
D19_REISEN                         3.236563
D19_SONSTIGE                       3.291989
LP_FAMILIE_FEIN                    3.712703
GEMEINDETYP                        5.326269
KBA13_ANZAHL_PKW                  65.975959
GEBURTSJAHR                      293.875728
sum                              414.274301
Length: 453, dtype: float64
In [86]:
ind = np.abs(sort_diff).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar',figsize=(12,6))
Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a444b82b0>
In [35]:
over_rep_df.ANZ_HAUSHALTE_AKTIV.mean()
Out[35]:
2.518459774041629
In [36]:
gen_pop_clust_df.ANZ_HAUSHALTE_AKTIV.mean()
Out[36]:
8.888137449189852
In [37]:
gen_pop_clust_df.ANZ_HAUSHALTE_AKTIV.mean()-over_rep_df.ANZ_HAUSHALTE_AKTIV.mean()
Out[37]:
6.369677675148223
In [74]:
new_df = pd.concat([gen_pop_clust_df.mean(), over_rep_df.mean()], axis= 1)
In [75]:
new_df_2 = new_df.applymap(lambda x: x if x != 0 else x+0.001)
In [64]:
(0.003489-0.189585)/0.189585
Out[64]:
-0.9815966453042172
In [63]:
new_df.loc['CAMEO_DEU_2015_9B',:]
Out[63]:
0    0.189585
1    0.003489
Name: CAMEO_DEU_2015_9B, dtype: float64
In [76]:
sort_diff2 = new_df.pct_change(axis='columns')[1].sort_values()
In [84]:
ind = np.abs(sort_diff2).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar', figsize=(12,6))
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a44124e48>
In [87]:
sort_diff3 = new_df_2.pct_change(axis='columns')[1].sort_values()
In [88]:
ind = np.abs(sort_diff3).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar', figsize=(12,6))
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a44749780>

Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [91]:
# mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
mailout_train = pd.read_csv('df_train.csv')
//anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [93]:
if 'Unnamed: 0' in mailout_train.columns:
    mailout_train.drop(['Unnamed: 0'], axis=1, inplace= True)
In [92]:
mailout_train.head()
Out[92]:
Unnamed: 0 LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ... VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP RESPONSE ANREDE_KZ ALTERSKATEGORIE_GROB
0 0 1763 2 1.0 8.0 NaN NaN NaN NaN 8.0 ... 5.0 2.0 1.0 6.0 9.0 3.0 3 0 2 4
1 1 1771 1 4.0 13.0 NaN NaN NaN NaN 13.0 ... 1.0 2.0 1.0 4.0 9.0 7.0 1 0 2 3
2 2 1776 1 1.0 9.0 NaN NaN NaN NaN 7.0 ... 6.0 4.0 2.0 NaN 9.0 2.0 3 0 1 4
3 3 1460 2 1.0 6.0 NaN NaN NaN NaN 6.0 ... 8.0 11.0 11.0 6.0 9.0 1.0 3 0 2 4
4 4 1783 2 1.0 9.0 NaN NaN NaN NaN 9.0 ... 2.0 2.0 1.0 6.0 9.0 3.0 3 0 1 3

5 rows × 368 columns

In [106]:
mailout_train_clean = clean_data(mailout_train)
initial (42962, 367)
after findNans (42962, 367)
r2= 0.703814760808833
MSE= 0.8065283917035022
after removing some rows (35000, 362)
after removing percent_NaN (35000, 361)
after OST_WEST_KZ (35000, 361)
after engineering EINGEFUEGT Features
after engineering wealth, life_stage
after engineering Movement and Decade
after engineering FEIN_wealth,stat,age and GROB_wealth,stat
before hot encoding (35000, 369)
after hot encoding (35000, 446)
after engineering sum, min, max, etc features
In [115]:
features = [c for c in mailout_train_clean.columns if c not in ['RESPONSE']]
response = mailout_train_clean['RESPONSE']
In [251]:
training_imp = imp.transform(mailout_train_clean[features])
training_scaled = scaler.transform(training_imp)
training_scaled_df = pd.DataFrame(data= training_scaled, columns=mailout_train_clean[features].columns)
training_pca = pca.transform(training_scaled)
training_clusters = final_model.predict(training_pca)
In [164]:
mailout_train_clean['cluster'] = training_clusters
In [253]:
training_scaled_df['cluster'] = training_clusters

Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [200]:
# mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
mailout_test = pd.read_csv('df_test.csv')
//anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [201]:
if 'Unnamed: 0' in mailout_test.columns:
    mailout_test.drop(['Unnamed: 0'], axis=1, inplace= True)
In [207]:
mailout_test_clean = clean_data(mailout_test, no_del=False)
initial (42833, 366)
after findNans (42833, 366)
r2= 0.7111432304253538
MSE= 0.7785144398246994
after OST_WEST_KZ (42833, 360)
after engineering EINGEFUEGT Features
after engineering wealth, life_stage
after engineering Movement and Decade
after engineering FEIN_wealth,stat,age and GROB_wealth,stat
before hot encoding (42833, 368)
after hot encoding (42833, 445)
after engineering sum, min, max, etc features
In [249]:
testing_imp = imp.transform(mailout_test_clean[features])
testing_scaled = scaler.transform(testing_imp)
testing_scaled_df = pd.DataFrame(data= testing_scaled, columns=mailout_test_clean[features].columns)
testing_pca = pca.transform(testing_scaled)
testing_clusters = final_model.predict(testing_pca)
In [209]:
mailout_test_clean['cluster'] = testing_clusters
In [254]:
testing_scaled_df['cluster'] = testing_clusters
In [169]:
param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'objective': 'binary', 
    'verbosity': 1
}
In [258]:
features.append('cluster')
In [214]:
folds = StratifiedKFold(n_splits=20, shuffle=False, random_state=42)
oof = np.zeros(len(mailout_train_clean))
predictions = np.zeros(len(mailout_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(mailout_train_clean.values, response.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(mailout_train_clean.iloc[trn_idx][features], label=response.iloc[trn_idx])
    val_data = lgb.Dataset(mailout_train_clean.iloc[val_idx][features], label=response.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(mailout_train_clean.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(mailout_test_clean[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(response, oof)))
Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.945226	valid_1's auc: 0.834192
[2000]	training's auc: 0.973461	valid_1's auc: 0.828724
[3000]	training's auc: 0.987494	valid_1's auc: 0.807824
[4000]	training's auc: 0.994014	valid_1's auc: 0.802145
Early stopping, best iteration is:
[1165]	training's auc: 0.951003	valid_1's auc: 0.842026
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.943227	valid_1's auc: 0.751196
[2000]	training's auc: 0.973302	valid_1's auc: 0.754929
[3000]	training's auc: 0.987145	valid_1's auc: 0.75485
[4000]	training's auc: 0.99428	valid_1's auc: 0.749698
Early stopping, best iteration is:
[1687]	training's auc: 0.966135	valid_1's auc: 0.761396
Fold 2
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.944791	valid_1's auc: 0.646617
[2000]	training's auc: 0.973337	valid_1's auc: 0.675351
[3000]	training's auc: 0.987266	valid_1's auc: 0.70122
[4000]	training's auc: 0.993939	valid_1's auc: 0.707897
[5000]	training's auc: 0.99751	valid_1's auc: 0.713602
[6000]	training's auc: 0.998966	valid_1's auc: 0.716625
[7000]	training's auc: 0.999599	valid_1's auc: 0.724302
[8000]	training's auc: 0.999904	valid_1's auc: 0.724802
[9000]	training's auc: 0.99998	valid_1's auc: 0.72517
[10000]	training's auc: 0.999997	valid_1's auc: 0.726721
[11000]	training's auc: 1	valid_1's auc: 0.735265
[12000]	training's auc: 1	valid_1's auc: 0.733898
[13000]	training's auc: 1	valid_1's auc: 0.736579
[14000]	training's auc: 1	valid_1's auc: 0.736053
Early stopping, best iteration is:
[11264]	training's auc: 1	valid_1's auc: 0.734818
Fold 3
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.944666	valid_1's auc: 0.787844
[2000]	training's auc: 0.973356	valid_1's auc: 0.775593
[3000]	training's auc: 0.987503	valid_1's auc: 0.774252
Early stopping, best iteration is:
[172]	training's auc: 0.92448	valid_1's auc: 0.813581
Fold 4
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.942127	valid_1's auc: 0.792208
[2000]	training's auc: 0.971174	valid_1's auc: 0.805379
[3000]	training's auc: 0.986264	valid_1's auc: 0.800568
Early stopping, best iteration is:
[200]	training's auc: 0.924366	valid_1's auc: 0.812845
Fold 5
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.943251	valid_1's auc: 0.785669
[2000]	training's auc: 0.972662	valid_1's auc: 0.796586
[3000]	training's auc: 0.98659	valid_1's auc: 0.807555
[4000]	training's auc: 0.994528	valid_1's auc: 0.803162
[5000]	training's auc: 0.997951	valid_1's auc: 0.802583
Early stopping, best iteration is:
[2459]	training's auc: 0.97934	valid_1's auc: 0.810027
Fold 6
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.944841	valid_1's auc: 0.693997
[2000]	training's auc: 0.972844	valid_1's auc: 0.726668
[3000]	training's auc: 0.985978	valid_1's auc: 0.730692
[4000]	training's auc: 0.99353	valid_1's auc: 0.742109
[5000]	training's auc: 0.997285	valid_1's auc: 0.738189
[6000]	training's auc: 0.998914	valid_1's auc: 0.73719
[7000]	training's auc: 0.999647	valid_1's auc: 0.738584
Early stopping, best iteration is:
[4180]	training's auc: 0.994504	valid_1's auc: 0.743739
Fold 7
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.944165	valid_1's auc: 0.746765
[2000]	training's auc: 0.972658	valid_1's auc: 0.776568
[3000]	training's auc: 0.986894	valid_1's auc: 0.789746
[4000]	training's auc: 0.994292	valid_1's auc: 0.799006
[5000]	training's auc: 0.997813	valid_1's auc: 0.787721
[6000]	training's auc: 0.999306	valid_1's auc: 0.780093
[7000]	training's auc: 0.999801	valid_1's auc: 0.77591
Early stopping, best iteration is:
[4000]	training's auc: 0.994292	valid_1's auc: 0.799006
Fold 8
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.943066	valid_1's auc: 0.813842
[2000]	training's auc: 0.973326	valid_1's auc: 0.818261
[3000]	training's auc: 0.986667	valid_1's auc: 0.809449
[4000]	training's auc: 0.993833	valid_1's auc: 0.803898
Early stopping, best iteration is:
[1960]	training's auc: 0.972149	valid_1's auc: 0.823259
Fold 9
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.946092	valid_1's auc: 0.689131
[2000]	training's auc: 0.973602	valid_1's auc: 0.683186
[3000]	training's auc: 0.987319	valid_1's auc: 0.672454
Early stopping, best iteration is:
[110]	training's auc: 0.915852	valid_1's auc: 0.731403
Fold 10
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.941611	valid_1's auc: 0.862742
[2000]	training's auc: 0.972175	valid_1's auc: 0.862374
[3000]	training's auc: 0.98688	valid_1's auc: 0.866872
[4000]	training's auc: 0.994104	valid_1's auc: 0.872711
[5000]	training's auc: 0.997706	valid_1's auc: 0.872448
[6000]	training's auc: 0.999244	valid_1's auc: 0.86845
[7000]	training's auc: 0.999802	valid_1's auc: 0.864162
Early stopping, best iteration is:
[4213]	training's auc: 0.995118	valid_1's auc: 0.877262
Fold 11
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.942376	valid_1's auc: 0.874579
[2000]	training's auc: 0.970888	valid_1's auc: 0.873001
[3000]	training's auc: 0.986334	valid_1's auc: 0.866504
Early stopping, best iteration is:
[823]	training's auc: 0.933203	valid_1's auc: 0.883102
Fold 12
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.94437	valid_1's auc: 0.788799
[2000]	training's auc: 0.972077	valid_1's auc: 0.784512
[3000]	training's auc: 0.986498	valid_1's auc: 0.786195
[4000]	training's auc: 0.994029	valid_1's auc: 0.780461
Early stopping, best iteration is:
[1751]	training's auc: 0.966074	valid_1's auc: 0.791982
Fold 13
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.943141	valid_1's auc: 0.765941
[2000]	training's auc: 0.97259	valid_1's auc: 0.764205
[3000]	training's auc: 0.98673	valid_1's auc: 0.760022
Early stopping, best iteration is:
[799]	training's auc: 0.934747	valid_1's auc: 0.782513
Fold 14
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.94207	valid_1's auc: 0.80061
[2000]	training's auc: 0.972126	valid_1's auc: 0.775726
[3000]	training's auc: 0.98663	valid_1's auc: 0.76452
Early stopping, best iteration is:
[929]	training's auc: 0.938005	valid_1's auc: 0.803425
Fold 15
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.943941	valid_1's auc: 0.705082
[2000]	training's auc: 0.973196	valid_1's auc: 0.709518
[3000]	training's auc: 0.987314	valid_1's auc: 0.708003
[4000]	training's auc: 0.99404	valid_1's auc: 0.701775
[5000]	training's auc: 0.997665	valid_1's auc: 0.697724
Early stopping, best iteration is:
[2680]	training's auc: 0.983713	valid_1's auc: 0.714506
Fold 16
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.942222	valid_1's auc: 0.826968
[2000]	training's auc: 0.972439	valid_1's auc: 0.823743
[3000]	training's auc: 0.986264	valid_1's auc: 0.81137
Early stopping, best iteration is:
[899]	training's auc: 0.937367	valid_1's auc: 0.832396
Fold 17
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.941253	valid_1's auc: 0.747933
[2000]	training's auc: 0.972121	valid_1's auc: 0.771881
[3000]	training's auc: 0.986792	valid_1's auc: 0.777227
[4000]	training's auc: 0.993916	valid_1's auc: 0.777172
[5000]	training's auc: 0.997511	valid_1's auc: 0.771192
Early stopping, best iteration is:
[2800]	training's auc: 0.984613	valid_1's auc: 0.783151
Fold 18
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.942981	valid_1's auc: 0.716518
[2000]	training's auc: 0.972117	valid_1's auc: 0.716325
[3000]	training's auc: 0.986494	valid_1's auc: 0.71897
[4000]	training's auc: 0.992913	valid_1's auc: 0.721065
[5000]	training's auc: 0.997051	valid_1's auc: 0.71142
[6000]	training's auc: 0.998586	valid_1's auc: 0.706322
Early stopping, best iteration is:
[3174]	training's auc: 0.988043	valid_1's auc: 0.727596
Fold 19
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.942062	valid_1's auc: 0.809882
[2000]	training's auc: 0.972909	valid_1's auc: 0.823468
[3000]	training's auc: 0.987286	valid_1's auc: 0.805859
Early stopping, best iteration is:
[131]	training's auc: 0.926739	valid_1's auc: 0.84127
CV score: 0.68300 
In [213]:
print("CV score: {:<8.5f}".format(roc_auc_score(response, oof)))
CV score: 0.68805 
In [237]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
In [220]:
mailout_test_clean['RESPONSE'] = predictions
In [225]:
mailout_test_clean.rename(columns={'pred': 'RESPONSE'}, inplace= True)
In [227]:
(mailout_test_clean[['LNR', 'RESPONSE']]).to_csv('x.csv', index= False)